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Preface 


Bayesian analysis provides a unified and coherent way of thinking about decision 
problems and how to solve them using data and other information. The goal of this 
book is to acquaint the reader in a serious way with this approach and its problem- 
solving potential, and to this end it has two objectives. The first is to provide 
a clear understanding of Bayesian analysis, grounded in the theory of inference 
and optimal decisionmaking, which will enable the reader to confidently analyze 
real problems. The second is to equip the reader with state-of-the-art simulation 
methods that can be used to solve these problems. 

This book is written for research professionals who use econometrics and similar 
statistical methods in their work, and for Ph.D. students in disciplines that do the 
same. These disciplines include economics and statistics, as well as the many social 
sciences and fields in business and public policy schools that study decisionmaking 
on the basis of data and other information. The book assumes the same knowledge 
of mathematical statistics as most Ph.D. courses in econometrics, and familiarity 
with linear models at the level of a graduate applied econometrics course or a 
master’s statistics course. The entire book was developed through a decade of 
teaching at this level, all of the material having been presented at least twice 
and some more than a half-dozen times. This vetting process has afforded the 
opportunity to minimize the barriers to entry to a sound and practical grasp of 
Bayesian analysis for the intended audience. 

Loosely speaking, the first three chapters address the objective of a clear under- 
standing of Bayesian analysis—how to think—and the next five, the objective 
of presenting and applying simulation methods—how to act. There is no sharp 
distinction between these two objectives. In particular, as one gains greater confi- 
dence with “hands on” methods, it is natural to rethink the formulation of problems 
at hand with the knowledge that what was not long ago impossible is now prac- 
tical. The text has many examples and exercises that follow this path, ranging 
from questions that have been used in examinations to substantial projects that 
extend or apply the methods presented. Some of these examples and exercises use 
the Bayesian analysis, computation, and communication (BACC) extension of the 
mathematical applications Matlab, Splus, R, and Gauss. The reader will find the 


ix 
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software and documentation, along with data and code for examples, in the online 
appendix for this text at http://www.biz.uiowa.edu/cbes. 

The book takes up specific models as vehicles for understanding Bayesian anal- 
ysis and applying simulation methods. This entails solving problems in a practical 
way and at the level of detail required by research professionals whose work must 
withstand subsequent scrutiny. In some cases these solutions did not exist only a 
few years ago (prior to 2005), and are not yet widely known among econometri- 
cians and statisticians. Therefore the book concentrates on a handful of models in 
some depth, rather than attempting to survey all models with a scope similar to 
that of leading (and much longer) graduate econometrics texts. The coverage here 
should not be taken as a judgment that other models are somehow less important or 
significant, or cannot be approached using Bayesian analysis. Just the opposite is 
true. The approaches and methods in this book are being used to improve models 
and decisionmaking at an accelerating rate, as perusal of the tables of contents of 
leading journals such as the Journal of the American Statistical Association, the 
Journal of the Royal Statistical Society, and the Journal of Econometrics will ver- 
ify. The reader of this book will be well equipped to understand this research, to 
appreciate its relevance to problems at hand, and to tailor existing methods to these 
problems. 

The organization is designed to meet a variety of uses in graduate education. All 
begin with Chapter 1, which provides an overview of the rest of the text at a lower 
technical level than is used subsequently. This material, which can be covered 
in 1—2 weeks in a traditional setting or in the first day of an intensive course, 
provides the reader with motivation for the more technical work that follows. A 
full-year graduate course can cover the first four chapters in the first semester, 
perhaps using the material on discrete-state Markov processes in Chapter 7 as an 
entrée to the theory of Markov chain Monte Carlo (MCMC) methods in Chapter 4. 
The second semester then begins with hands-on computing and applications and 
proceeds through the rest of the book. One can base a one-semester course on 
Chapters | and 2, the first three sections of Chapter 4, Section 5.1, plus other parts 
of Chapters 5, 6, and 7 as time and interests dictate. For example, completion 
of Chapter 5 will concentrate on linear models. Chapter 6 concentrates on latent 
variable models, and for this concentration the material on hierarchical priors at the 
start of Chapter 3 may also be of interest. An intensive applications-oriented course 
of 1—2 weeks can be based on Chapter 1, Section 2.1, Section 4.3, and Section 
5.1, plus other parts of Chapters 5, 6, and 7 consistent with time and interests. The 
online appendix provides ample material for computing laboratory sessions in such 
a course. 

I am very grateful to a number of people who contributed, in one way or 
another, to the book. Scores of graduate students were involved since the mid- 
1990s as material was developed, discarded, modified, and redeveloped in graduate 
courses at the Universities of Minnesota and Iowa. Of these former graduate or 
postdoctoral students, Gianni Amisano, Pat Bajari, Hulya Eraslan, Merrell Hora, 
John Landon-Lane, Lea Petrella, Arnie Quinn, Hisashi Tanizaki, and Nobuhiko 
Terui all played roles in improving the text, computing code, or examples. I owe 
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a special debt to my former student Bill McCausland, who also conceived the 
BACC software and brought it into being. I am grateful to the National Science 
Foundation for support of software development and research incorporated here. 
For nurturing many aspects of the Bayesian approach to thinking reflected in these 
pages, I am especially grateful to Jim Berger, Jay Kadane, Dennis Lindley, Dale 
Poirier, Christopher Sims, Luke Tierney, and Arnold Zellner. Finally, for advice 
and comments on many specific aspects of the book I thank Siddhartha Chib, 
Bill Griffiths, Gary Koop, Peter Rossi, Christopher Sims, Mark Steel, and Herman 
van Dijk. 


JOHN GEWEKE 


Iowa City, Iowa 


CHAPTER 1 


Introduction 


The evolution of modern society is driven by decisions that affect the welfare 
and choices of large groups of individuals. Of the scores of examples, a few will 
illustrate the characteristics of decisionmaking that motivate our approach: 


1. A new drug has been developed in the laboratories of a private firm over 
a period of several years and at a cost of tens of millions of dollars. It has 
been tested in animals, and in increasingly larger groups of human beings 
in a succession of highly structured clinical trials. If the drug is approved 
by the Food and Drug Administration (FDA), it will be available for all 
licensed physicians to use at their discretion. The FDA must decide whether 
to approve the drug. 


2. Since the mid-1980s evidence from many different sources, taken together, 
clearly indicates that the earth’s climate is warming. The evidence that this 
warming is due to human activities, in particular the emission of carbon 
dioxide, is not as compelling but becomes stronger every year. The economic 
activities responsible for increases in the emission of carbon dioxide are 
critical to the aspirations of billions of people, and to the political order that 
would be needed to sustain a policy that would limit emissions. How should 
the evidence be presented to political leaders who are able to make and 
enforce decisions about emissions policy? What should their decision be? 


3. A multi-billion-dollar firm is seeking to buy a firm of similar size. The 
two firms have documented cost reductions that will be possible because 
of the merger. On the other hand, joint ownership of the two firms will 
likely increase market power, making it in the interests of the merged firm 
to set higher price cost margins than did the two firms separately. How 
should lawyers and economists—whether disinterested or not—document 
and synthesize the evidence on both points for the regulatory authorities who 
decide whether to permit the merger? How should the regulatory authorities 
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make their decision? If they deny the merger, the firms must decide whether 
to appeal the decision to the courts. 


. A standard petroleum refining procedure produces two-thirds unleaded gaso- 


line and one-third heating oil (or jet aviation fuel, its near equivalent). 
Refinery management buys crude oil, and produces and sells gasoline and 
heating oil. The wholesale prices of these products are volatile. Management 
can guarantee the difference between selling and buying prices, by means 
of futures contracts in which speculators (risk takers) commit to purchasing 
specified amounts of gasoline or heating oil, and selling agreed-on amounts 
of crude oil, at fixed prices. Should management lock in some or all of its 
net return in this way? If some, then how much? 


These decisions differ in many ways. The second and third will appear promi- 
nently in the media; the first might, the last rarely will. The second is a matter 
of urgent global public policy, and the last is entirely private. The other two are 
mixtures; in each case the final decision is a matter of public policy, but in both the 
matter is raised to the level of public policy through a sequence of private decisions, 
in which anticipation of the ultimate public policy decision is quite important. 

Yet these decisions have many features in common: 


1. 


The decision must be made on the basis of less-than-perfect information. 
By “perfect information” is meant all the information the decisionmaker(s) 
would requisition if information were free, that is, immediately available at 
no cost in resources diverted from other uses. 


. The decision must be made at a specified time. Either waiting is prohibited 


by law or regulation (examples 1 and 3), is denied by the definition of the 
decision (example 4), or “wait” amounts to making a critical choice that may 
circumscribe future options (example 2). 


. The information bearing on the decision, and the consequences of the deci- 


sion, are primarily quantitative. The relationship between information and 
outcome, mediated by working hypotheses about the connection between the 
two, is nondeterministic. 


. There are multiple sources of information bearing on each decision. Whether 


the information is highly structured and derived from controlled experiments 
(example 1), consists of numerous studies using different approaches and 
likely reaching different conclusions (examples 2 and 3), or originates in 
different time periods and settings whose relation to the decision at hand must 
be assessed repeatedly (example 4), this information must be aggregated, 
explicitly or implicitly, in the decision. 


We will often refer to “investigators” and “clients,” terms due to Hildreth (1963). 
The investigator is the applied statistician or econometrician whose function is to 
convey quantitative information in a manner that facilitates and thereby improves 
decisions. The client may be the actual decisionmaker, or—more often—another 
scientist working to support the decision with information. The client’s identity 
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and preferences may be well known to the investigator (example: an expert witness 
hired by any interested party), or many clients may be unknown to the investigator 
(example: the readers of a subsequently well-cited academic paper reporting the 
investigator’s work). 

The objective of this book is to provide investigators with understanding and 
technical tools that will enable them to communicate effectively with clients, includ- 
ing decisionmakers and other investigators. Several themes emerge: 


1. Make all assumptions explicit. 
2. Explicitly quantify all of the essentials, including the assumptions. 


3. Synthesize, or provide the means to synthesize, different approaches and 
models. 


4. Represent the inevitable uncertainty in ways that will be useful to the client. 


The understanding of effective communication is grounded in Bayesian inference 
and decision theory. The grounding emerges not from any single high-minded 
principle, but rather from the fact that this foundation is by far the most coherent 
and comprehensive one that presently exists. It may eventually be superseded by 
a superior model, but for the foreseeable future it is the foundation of economics 
and rational quantitative decisionmaking. 

The reader grounded in non-Bayesian methods need not take any of this for 
granted. To these readers, the utility of the approach taken here will emerge as 
successive real problems succumb to effective treatment using Bayesian methods, 
while remaining considerably more difficult, if not entirely intractable, using non- 
Bayesian approaches. 

Simulation methods provide an indispensable link between principles and prac- 
tice. These methods, essentially unavailable before the late 1980s, represent uncer- 
tainty in terms of a large but finite number of synthetic random drawings from the 
distribution of unobservables (examples: parameters and latent variables), condi- 
tional on what is known (examples: data and the constraints imposed by economic 
theory) and the model(s) used to relate unobservables to what is known. Algorithms 
for the generation of the synthetic random drawings are governed by this represen- 
tation of uncertainty. The investigator who masters these tools not only becomes a 
more fluent communicator of results but also greatly expands the choices of con- 
texts, or models, in which to represent uncertainty and provide useful information 
to decisionmakers. 


1.1 TWO EXAMPLES 


This chapter is an overview of the chapters that follow. It provides much of what is 
needed for the reader to be a knowledgeable client, that is, a receiver of information 
communicated in the way just discussed. Being an effective investigator requires 
the considerably more detailed and technical understanding that the other chapters 
convey. 
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1.1.1 Public School Class Sizes 


The determination of class size in public schools is a political and fiscal decision 
whose details vary from state to state and district to district. Regardless of the 
details, the decision ultimately made balances the fact that, given the number of 
students in the district, a lower student : teacher ratio is more costly, against 
the perception that a lower student : teacher ratio also increases the quality of 
education. Moreover, quality is difficult to measure. The most readily available 
measures are test scores. Changes made in federal funding of locally controlled 
public education since 2001 emphasize test scores as indicators of quality, and 
create fiscal incentives for local school boards to maintain and improve the test 
scores of students in their districts. 

In this environment, there are several issues that decisionmaking clients must 
address and in which Bayesian investigation is important: 


1. What is the relationship between the student : teacher ratio and test scores? 
Quite a few other factors, all of them measurable, may also affect test scores. 
We are uncertain about how to model the relationship, and for any one model 
there is uncertainty about the parameters in this model. Even if we were 
certain of both the model and the parameters, there would still be uncertainty 
about the resulting test scores. Full reporting and effective decisionmaking 
require that all these aspects of uncertainty be expressed. 


2. The tradeoff between costs, on one hand, and quality of education, on the 
other hand, needs to be expressed. “Funding formulas” that use test scores 
to determine revenues available to school administrators (the clients) express 
at least part of this relationship quantitatively. In addition, a client may wish 
to see the implications of alternative valuations of educational quality, as 
expressed in test scores, for decisions about class size. Funding formulas may 
be expressed in terms of targets that make this an analytically challenging 
problem. The simulation methods that are an integral part of contemporary 
Bayesian econometrics and statistics make it practical to solve such problems 
routinely. 


3. Another set of prospective clients consists of elected and appointed poli- 
cymakers who determine funding formulas. Since these policymakers are 
distinct from school administrators, any funding formula anticipates (at least 
implicitly) the way that these administrators will handle tradeoffs between 
the costs of classroom staffing and the incentives created in the funding for- 
mulas. Depending on administrators’ behavior, different policies may incur 
higher, or lower, costs to attain the same outcome as measured by test scores. 


Bayesian analysis provides a coherent and practical framework for combining 
information and data in a useful way in this and other decisionmaking situations. 
Chapters 2 and 3 take up the critical technical steps in integrating data and other 
sources of information and representing the values of the decisionmaking client. 
Chapter 4 provides the simulation methods that make it practical and routine to 


TWO EXAMPLES 5 


undertake the required analysis. The remaining chapters return to this particular 
decision problem at several points. 


1.1.2 Value at Risk 


Financial institutions (banks, brokerage firms, insurance companies) own a variety 
of financial assets, often with total value in the many billions of dollars. They may 
include debt issued by businesses, loans to individuals, and government bonds. 
These firms also have financial liabilities: for example, deposit accounts in the 
case of private banks and life insurance policies in the case of insurance companies. 
Taken together, the holdings of financial assets or liabilities by a firm are known 
as its “portfolio.” 

The value of an institution’s portfolio, or of a particular part of it, is constantly 
changing. This is the case even if the institution initiates no change in its holdings, 
because the market price of the institution’s assets or liabilities change from day to 
day and even minute to minute. Thus every such institution is involved in a risky 
business. In general, the larger the institution, the more difficult it is to assess this 
risk because of both the large variety of assets and liabilities and the number of 
individuals within the institution who have authority to change specified holdings 
in the institution’s portfolio. 

Beginning about 1990 financial institutions, and government agencies with 
oversight and regulatory responsibility for these institutions, developed measures 
of the risk inherent in institutions’ portfolios. One of the simplest and most widely 
used is value at risk. To convey the essentials of the idea, let p, be the market 
value of an institution’s entire portfolio, or of a defined portion of it. In the former 
case, p; is the net worth of the institution— what would remain in the hypothetical 
situation that the institution were to sell all its assets and meet all of its liabilities. 
In the latter case it might be (for example) the institution’s holding of conventional 
mortgages, or of U.S. government bonds. 

The value p; is constantly changing. This is in part a consequence of holdings 
by the institution, but it is also a result of changes in market prices. Value at risk 
is more concerned with the latter, so p, is taken to be the portfolio value assuming 
that its composition remains fixed. “Value at risk” is defined with respect to a future 
time period, say, t*, relative to the current period t, where t* > t and t* — t may 
range from less than a day to as long as a month. A typical definition of value at 
risk is that it is the loss in portfolio value v, p that satisfies 


P(pt — pr = Vr) = .05. (1.1) 


Thus value at risk is a hypothetical decline in value, such that the probability of an 
even greater decline is 5%. The choice of .05 appears arbitrary, since other values 
could be used, but .05 is by far the most common, and in fact some regulatory 
authorities establish limits of v, + in relation to p; based on (1.1). 

The precise notion of probability in (1.1) is important. Models for establishing 
value at risk provide a distribution for p,;., conditional on p, and, perhaps, other 
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information available at time t. From this distribution we can then determine 1, r. 
Most models used for this purpose are formulated in terms of the period-to-period 
return on the portfolio 


ry, = (Pi — Pr-1)/Pr-1, 


and statistical modeling usually directly addresses the behavior of the time series 


yr = log(1 + r;) = log(p;/pr-1)- (1.2) 


One of the simplest models is 
yı ~ N(u, 0°). (1.3) 


Even this simple model leaves open a number of questions. For example, is it really 
intended that the same model (including the same mean and variance) pertains today 
for “high tech” stocks as it did in 1999, before the rapid decline in their value? In 
any event, the parameters jz and o? are unknown, so how is this fact to be handled 
in the context of (1.1)? This problem is especially vexing if u and ø? are subject 
to periodic changes, as the high-tech example suggests at least sometimes must be 
the case if we insist on proceeding with (1.3). 

One of the biggest difficulties with (1.3) is that it is demonstrably bad as a 
description of returns that are relatively large in absolute value, at least with fixed 
u and o°. If we take as the fixed values of u and o° their conventional estimates 
based on daily stock price indices for the entire twentieth century, then the model 
implies that “crashes” like the one that occurred in October 1987, are events that 
are so rare as to be impossible for all practical purposes. [For the daily Standard 
and Poors 500 stock returns for January 3, 1928—April 30, 1991, from Ryden et al. 
(1998) used in Sections 7.3 and 8.3, the mean is .000182, the standard deviation 
is .0135, and the largest return in absolute value is —.228, which is 16.9 standard 
deviations from the mean. If z ~ N(0,1) then P(z < —16.9) = 2.25 x 10-™. The 
inverse of this probability is 4.44 x 10°. Dividing by 260 trading days in the year 
yields 1.71 x 106! years. The estimated age of the universe is 1.2 x 10!° years. 
Chapter 8 takes up Bayesian specification analysis, which is the systematic and 
constructive assessment of this sort of incongruence of a model with reality.] This, 
of course, makes explicit the fact that we are uncertain about more than just the 
unknown parameters u and o? in (1.3). In fact we are also uncertain about the 
functional form of the distribution, and our notion of “probability” in (1.1) should 
account for this, too. 

Section 1.4 introduces an alternative to (1.3), which is developed in detail in 
Section 7.3. An important variant on the value at risk problem arises when a 
decisionmaker (say, a vice president of an investment bank) selects the value .05, 
as opposed to some other probability, in (1.1). This integration of behavior with 
probability is the foundation of Bayesian decision theory, as well as of important 
parts of modern economics and finance. We shall return to this theme repeatedly, 
for example, in Sections 2.4 and 4.1. 
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1.2 OBSERVABLES, UNOBSERVABLES, AND OBJECTS OF INTEREST 


A model is a simplified description of reality that is at least potentially useful in 
decisionmaking. Since models are simplified, they are never literally true; what- 
ever the “data-generating process” may be, it is not the model. Since models are 
constructed for the purpose of decisionmaking, different decision problems can 
appropriately lead to different models despite the fact that the reality they simplify is 
the same. A well-known example is Newtonian physics, which is inadequate when 
applied to cosmology or subatomic interactions but works quite well in launching 
satellites and sending people to the moon. In the development of positron emission 
tomography and other kinds of imaging based on the excitation of subatomic par- 
ticles, on other hand, quantum mechanics (a different model) functions quite well 
whereas Newtonian mechanics is inapplicable. 

All scientific models have certain features in common. One is that they often 
reduce an aspect of reality to a few quantitative concepts that are unobservable but 
organize observables in a way that is useful in decisionmaking. The gravitational 
constant or the charge of an electron in physics, and the variance of asset returns or 
the equation of a demand function in the examples in the previous section are all 
examples of unobservables. Observables can be measured directly; the acceleration 
of an object when dropped, the accumulation of charge on an electrode, average 
test scores in different school districts, and sample means of asset returns are all 
examples. 

A model posits certain relationships between observables and unobservables; 
without these relationships the concepts embodied in the unobservables would be 
vacuous. A scientific model takes the form “Given the values of the unobservables, 
the observables will behave in the following way.” The relationship may or may 
not be deterministic. Thus a model may be cast in the form 


Ply | 9), 


in which @ is a vector of unobservables and y is a vector of observables. The 
unobservables @ are typically parameters or latent variables. It is important to 
distinguish between the observables y, a random vector, and their values after 
they are observed, which we shall denote y° and are commonly called “data.” 
The functional form of the probability density p gives the model some of its 
content. In the simple example of Section 1.1.1 the observables might be pairs of 
student : teacher ratios and test score averages in a sample of school districts, and 
the unobservables the slope and intercept parameters of a normal linear regression 
model linking the two. In the simple example of Section 1.1.2, the observables 
might be asset returns y1, ..., yr, and the unobservable is c= var(y;). 

The relationship p(y | 0) between observables and unobservables is central, but 
it is not enough for decisionmaking. The relationship between the gravitational 
constant g and the acceleration that results when a force is applied to a mass is 
not enough to deliver a communications satellite into orbit—we had better know 
quite a lot about the value of g. Likewise, in assessing value at risk using the 
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simplified model of Section 1.1.2, we must know something about o°. In general, 
the density p(y | 0) may restrict the behavior of y regardless of @ (e.g., when 
dropped, everyday objects accelerate at rates that differ negligibly with their mass) 
but for decisionmaking we must know something about 0. (An object will fall 
about how many meters per second squared at sea level?) A very general way to 
represent knowledge about 0 is by means of a density p(@). Formally, we may 
combine p(@) and p(y | 0) to produce information about the observables: 


py) = f p(0)p(y | 0) dé. 


How we obtain information about 0, and how p(@) changes in response to new 
information are two of the central topics of this book. In particular, we shall turn 
shortly to the question of how information about 0 changes when y is observed. 

In any decision there is typically more than one model at hand that is at least 
potentially useful. In fact, much of the work of actual decisionmakers lies in sorting 
through and weighing the implications of different models. To recognize this fact, 
we shall further index the relation between observables and unobservables by A to 
denote the model: p(y | 0) becomes p(y |04, A), and p(@) becomes p(04 | A). 
The vector of unobservables (in many cases, the parameters of the model A) 04 
belongs to the set ©4 C IR“. Alternative models will be denoted Aj, Az, .... Note 
that the unobservables need not be the same in the models, but the observables 
y € Y are. When several models have the same set of observables, and then we 
obtain observations (which we call “data’’), it becomes possible to discriminate 
among models. We shall return to this topic in Section 1.5, where we will see 
that with a bit more effort we can actually use the data to assign probabilities to 
competing models. 

More generally, however, the models relevant to the decision at hand need not all 
have the same set of observables. A classic example is the work of Friedman (1957) 
on the marginal propensity to consume. One model (A1) used aggregate time series 
data on income and consumption, while another model (A2) used income and con- 
sumption measures for different households at the same point in time. The sets of 
models addressed the same unobservable—marginal propensity to consume— but 
reached different conclusions. Friedman’s contribution was to show that the models 
A; and A, did, indeed, have different unobservables (04, and 04,), and that the 
differences in 04, and 04, were consistent with a third, more appropriate, concept 
of marginal propensity to consume. We shall denote the object of interest on which 
decisionmaking depends, and which all models relevant to the decision have some- 
thing to say, by the vector œ. We shall denote the implications of model A for œ 
by p(@ | y, 4, A). The models at hand must specify this density; if they do not, 
then they are not pertinent to the decision at hand. 

We can apply this idea to the two examples in the previous section. In the 
case of the class size decision, œ might be a q x 1 vector of average test scores 
conditional on q alternative decisions that might be made about class size. In the 
case of value at risk, œ might be a 5 x 1 vector, the value of the portfolio at the 
end of each of the next 5 business days. 
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In summary, we have identified three components of a complete model, A, 
involving unobservables (often parameters) 04, observables y, and a vector of 
interest @: 


p04 |A), (1.4) 
p |04, A), (1.5) 
p(@ |y, 9a, A). (1.6) 


The ordering of (1.4)—(1.6) emphasizes the fact that the model A specifies the joint 
distribution 


P@a,y,@| A) = p@a | A)p(y | Oa, A)p@ | y, Oa, A). (1.7) 


It is precisely this joint distribution that makes it possible to use data to inform deci- 
sions in an internally consistent manner, and— with more structure to be introduced 
in Section 1.6—addresses the question of which decision would be optimal. 


Exercise 1.2.1 Conditional Probability. A test for the presence of a disease can 
be administered by a nurse. A result “positive” (+) indicates disease present; a 
result “negative” (—) indicates disease absent. However, the test is not perfect. The 
sensitivity of the test is the probability of a “positive” result conditional on the 
disease being present; it is .98. The specificity of the test is the probability of a 
“negative” result conditional on the disease being absent; it is .90. The incidence 
of the disease is the probability that the disease is present in a randomly selected 
individual; it is .005. 

Denoting specificity by p, sensitivity by q, incidence by zr, and test outcome by 
+ or —, develop an expression for the probability of disease conditional on a “pos- 
itive” outcome and one for the probability of disease conditional on a “negative” 
outcome, if the test is administered to a randomly selected individual. Evaluate 
these expressions using the values given above. 


Exercise 1.2.2 Non-Bayesian Statistics. Suppose the model A is y ~ N(y, 1), 
u = 0, and the sample consists of a single observation y = y°. 


(a) Show that S = (max(y — 1.96, 0), max(y + 1.96, 0)) is a 95% classical 
confidence interval for jz, that is, P(u € S | u, A) = .95. 

(b) Show that if y° = —2.0 is observed, then the 95% classical confidence inter- 
val is the empty set. 


Exercise 1.2.3 Ex Ante and Ex Post Tests. Let y have a uniform distribution 
on the interval (6,0 + 1), and suppose that it is desired to test the null hypothesis 
Ho : 0 = 0 versus the alternative hypothesis H, : © = 0.9 (which are the only two 
values of @ that are possible). A single observation x is available. Consider the test 
that rejects Ho if y > 0.95, and accepts Hp otherwise. 


(a) Calculate the probabilities of type I and type II errors for this test. 
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(b) Explain why it does not make common sense, for decisionmaking purposes, 
to accept mechanically the outcome of this test when the observed y° lies 
in the interval (0.9, 1.0). 


1.3 CONDITIONING AND UPDATING 


Because a complete model provides a joint density p(04, y, @ | A), it is in prin- 
ciple possible to address the entire range of possible marginal and conditional 
distributions involving the unobservables, observables, and vector of interest. Let 
y° denote the actual value of the observable—the data, “y observed.” Then with the 
data in hand, the relevant probability density for a decision based on the model A 
is p(@ | y°, A). This is the single most important principle in Bayesian inference in 
support of decisionmaking. The principle, however, subsumes a great many details 
taken up in subsequent chapters. 

It is useful to break up the process of obtaining p(@ | y°, A) into a number of 
steps, and to introduce some more terminology. The distribution corresponding to 
the density p(@,4 | A) is usually known as the prior distribution and that corre- 
sponding to p(y | 04, A), as the observables distribution. The distribution of the 
unobservable 04, conditional on the observed y’, has density 


Ba Ly’, A) = POLLITA _ POA! APO 184, A) 
mee ply’ | A) ply’ | A) 
x p(s | A)ply? | 64, A). 


(1.8) 


Expression (1.8) is usually called the posterior density of the unobservable 04. The 
corresponding distribution is the posterior distribution. 

The distinction between the prior and posterior distributions of 04 is not quite 
as tidy as this widely used notation and terminology suggests, however. To see 
this, define Y; = (yi, ..., y1), for t=0,...,7 with the understanding that 
Yo = {Ø}, and consider the decomposition of the probability density of the observ- 
ables y = Y,: 


T 
p(y |04, A) =] [ P0: | ¥r-1, 64, A). (1.9) 


t=1 


In fact, densities of observables are usually constructed in exactly this way, because 
when there is dependence between observations, a recursive model is typically the 
natural representation. 

Suppose that Y?’ = (y{’,..., y?”) is available but (y?!,,...,y7) is not. (If “t” 
denotes time, then we are between periods ¢ and t + 1). Then 


PO | ¥7, A) x p@a | Ap? | Oa, A) 


t 
= pO4| A) | ply? 1Y 64, A). 


s=l 
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When yy, becomes available, then 


t+1 
p@al¥?,,,A) « p@al AD [PY 04, A) 


s=] 


x p(0a | ¥?, DPO | ¥?, 04, A). (1.10) 


The change in the distribution of 64 brought about by the introduction of y?,,, 
made clear in (1.10), is usually known as Bayesian updating. Comparing (1.10) 
with (1.8), note that p(@4 | Y?, A) plays the same role in (1.10) as does the prior 
density p(@4 | A) in (1.8), and that p(y?,, | Y?, 04, A) plays the same role in 
(1.10) as does p(y? | 04, A) in (1.8). Indeed, from the perspective of what happens 
at “time” t+ 1, p(@4 | Y?, A) is the prior density of 04, and p(04 | Y?,,, A) is 
the posterior density of 64. This emphasizes the fact that “prior” and “posterior” 
distributions (or densities, or moments, or other properties of unobservables) are 
always with respect to an incremental information set. In (1.8) this information is 
the entire data set y° = Y}, whereas in (1.10) it is y?,,. 
From the posterior density (1.8), the density relevant for decisionmaking is 


p(@ |y’, A) -f POs ly’, Apl |04, y’, A)d0a. (1.11) 


Oa 


It is important to acknowledge that we are proceeding in a way that is different 
from most non-Bayesian statistics, generally termed “classical” statistics. The key 
difference between Bayesian and non-Bayesian statistics is, in fact, in conditioning. 
Likelihood-based non-Bayesian statistics conditions on A and 0 4, and compares the 
implication p(y | 64, A) with y°. This avoids the need for any statement about the 
prior density p(@,4 | A), at the cost of conditioning on what is unknown. Bayesian 
statistics conditions on y°, and utilizes the full density p(04, y, œ | A) to build up 
coherent tools for decisionmaking, but demands specification of p(@4 | A). 

The strategic advantage of Bayesian statistics stems from the fact that its con- 
ditioning is driven by the actual availability of information and by its complete 
integration with the theory of economic behavior under uncertainty, achieved by 
Friedman and Savage (1948, 1952). We shall return to this point in Section 1.6 and 
subsequently in this book. 

Two additional matters need to be addressed, as well. The first is that (1.8) and 
(1.11) are mere formalities as stated; actually representing the densities p(@4 | y°, A) 
and p(@ | y’, A) in practical ways for decisionmaking is a technical challenge of high 
order. Indeed, the principles stated here have been recognized since at least the mid- 
1950s, but it was not until the application of simulation methods in the 1980s that 
they began to take on the practical significance that they have today. We return to 
these developments in Section 1.4 and Chapter 4. 

The other matter ignored is explicit attention to multiple models A,,..., Ay. In 
fact, it is not necessary to confine attention to a single model, and the developments 
here may be extended to several models simultaneously. We do this in Section 1.5. 
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Exercise 1.3.1 A Simple Posterior Distribution. Suppose that y ~ N (u, 1) and 
the sample consists of a single observation y°. Suppose that an investigator has a 
prior distribution for u that is uniform on (0, 4). 


(a) Derive the investigator’s posterior distribution for m. 
(b) Suppose that y° = —2. Find an interval (u4, 42) such that 


Plu € (m, m2) | y°] = 0.95. 


(The answer consists of a pair of real numbers.) 
(c) Do the same for the case y° = 1. 


(d) Are your intervals in (b) and (c) the shortest possible in each case? (You 
need not use a formal argument. A sketch is enough.) 


Exercise 1.3.2 Applied Conditioning and Updating. On a popular, nationally 
televised game show the guest is shown three doors. Behind one door there is a 
valuable prize (e.g., a new luxury automobile), and behind the other two doors there 
are trivial prizes (perhaps a new toaster). The host of the game show knows which 
prizes are behind which doors. The guest, who cannot see the prizes, chooses one 
door for the host to open. But before he opens the door selected by the guest, the 
host always opens one of the two doors not chosen by the guest, and this always 
reveals a trivial prize. (The guest and the television audience, having watched the 
show many times, know that this always happens.) The guest is then given the 
opportunity to change her selected door. After the guest makes her final choice, 
that door is opened and the guest receives the prize behind her chosen door. 

If you were the guest, would you change your door selection when given the 
opportunity to do so? Would you be indifferent about changing your selection? 
Defend your answer with a formal probability argument. 


Exercise 1.3.3 Prior Distributions. Two graduate students play the following 
game. An amount of money W is placed in a sealed envelope. An amount 2W is 
placed in another sealed envelope. Student A is given one envelope, and student 
B is given the other envelope. (The assignment of envelopes is random, and the 
students do not know which envelope they have received.) Before student A opens 
his envelope and keeps the money inside, he may exchange envelopes with student 
B, if B is willing to do this. (At this point, B has not opened her envelope, 
either; the game is symmetric.) In either case, each student keeps the money in the 
envelope finally accepted. Both students are rational and risk-neutral; that is, they 
behave so as to maximize the expected value of the money they keep at the end 
of the game. 

Student A reasons as follows. “There is an unknown amount of money, x, in 
my envelope. It is just as likely that B’s envelope has 2x as it is that it has x/2. 
Conditional on x, my expected gain from switching envelopes is .5(2x + .5x) — 
x = .25x. Since this is positive for all x, I should offer to switch envelopes.” 

Student B says that the expected gain from switching envelopes is zero. 
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Explain the fallacy in A’s argument, and provide the details of B’s argument. 
In each case use the laws of probability carefully. 


1.4 SIMULATORS 


Decisionmaking requires specific tasks involving posterior distributions. The finan- 
cial manager in Section 1.1.2 is concerned about the distribution of values of an 
asset 5 days from now w = pr+45 = pr exp 2 Yr+s). She has at hand observa- 
tions on returns through the present time period, T, of the form y° = (y/,..., YP, 
and is using a model with a parameter vector 04. The value at risk she seeks to 
determine is the number c with the property 


pr-c 
f pæ | y°, A) dæ = 0.05. 


(oe) 


The manager might recognize that she can decompose this problem into two 
parts. First, if she knows the value of 84—or, more precisely, if the model A 
specifies the value of 04 with no uncertainty—then finding c amounts to deriving 
the inverse cumulative distribution function (cdf) of œ from p(yrii,..., yr+s | 
y°’, 04, A). This task can be completed analytically for the model (1.3) with known 
u and o°, but for realistic models with uncertainty about parameters this is at best 
tedious and in general impossible. 

At this point the financial manager, or one of her staff, might point out that it is 
relatively easy to simulate most models of financial time series. One such model is 
the Markov mixture of normals model, discussed in more detail in Section 7.3, in 
which each y; is drawn from one of L alternative normal distributions N (u $s a). 
Each day t is characterized by an unobserved state variable s, that assumes one of 
the values 1, 2,... or L, and then 


s =j > y ~ N(u;,0%). (1.12) 
The state variables themselves obey a first-order Markov process in which 
P(s; = j | s-1 = 1) = pij. (1.13) 


In applications to financial modeling it is reasonable that the values of oF vary 
substantially depending on the state, for example, o? / o? 7 3, and the state variable 
is persistent as indicated by pj; >> >> jæi Pij- Such a structure gives rise to episodes 
of high and low volatility, a feature seen in most financial returns data. 

Widely available mathematical applications software makes it easy to simulate 
this and many other models. Given the current state s; = i, the next period’s state 
is drawn from the distribution (1.13), and then y,;; is drawn from the selected 
normal distribution in (1.12). Our firm manager can exploit this fact if she knows 
the parameters of the model and the current state sr = j. She repeatedly simulates 
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the model forward from the current day 7, obtaining in simulation m the returns 
vy (s =1,...,5) and the corresponding simulated asset price 5 days hence, 
wo” = p? exp% yf”). At the end she can sort the M simulations of w, and 
find a number c™ such that 5% of the draws are below and 95% are above 
Pr — c>. It turns out that c > cas increases. 

This solves only part of the manager’s problem. The model, in fact, has many 
unobservables, not only the unknown parameters pj, o? and p;; but also the 
states s,. Together they constitute the unobservables vector 04 in this model. The 
simulation just described requires all of the parameters and the current state sr. 


Noting that 


plo | y°, A) si plo |y°, 04, A)p@a ly’, A) d0, (1.14) 


Oa 


the manager might well recognize that if she could simulate 
aN  p@a ly’, A) (1.15) 
and next apply the algorithm just described to draw 
m) o ge” A 1.16 
w p(w |y’, 0a A), (1.16) 


then the distribution of œ®™ would be that corresponding to the density (1.14). 

This strategy is valid, but producing the draws in (1.15) is much more challeng- 
ing than was developing the algorithm behind (1.16). The latter simulation was 
relatively easy because it corresponds to the recursion in the natural expression of 
the model; recall (1.4)—(1.6). Given 04, the model tells us how yı, then y2, and 
so on, are produced, and as a consequence simulating into the future is typically 
straightforward. The distribution (1.15), on the other hand, asks us to reverse this 
process: given that a set of observables was produced by the model A, with prior 
distribution p(@4 | A) and observables distribution p(y | 04,4), make drawings 
from the distribution with posterior density p(04 | y°, A). The formal definition 
(1.8) is not much help in this task. 

This impasse is typical if we attempt to use simulation to unravel the actual 
distribution corresponding to p(w | y°, A) in a useful way. Until the late 1980s 
this problem had succumbed to solution in only a few simple cases, and these did 
not go very far beyond the even smaller set of cases that could be solved analyt- 
ically from start to finish. Geweke (1989a) pointed out that importance sampling 
methods described in Hammersly and Handscomb (1964) could be used together 
with standard optimization methods to simulate o” ~ p(0a | y°, A). The follow- 
ing year Gelfand and Smith (1990) published their discovery that methods then 
being used in image reconstruction could be adapted to construct a Markov chain 
G such that if 


0 a p64 10%, y’, G) 
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then oe” = p(O@4 | y°, A). This work in short order burgeoned into an even more 
general set of procedures, known as Markov chain Monte Carlo (MCMC), which 
achieves the same result for almost any complete model. Section 7.3 shows how to 
apply these methods to the Markov mixture of normals model used in this example. 

All of these methods, including importance sampling, produce what are known 
as posterior simulators. These algorithms make it practical to address quantitative 
decisionmaking problems, using a rich variety of models. Posterior simulators are 
the focus of Chapter 4. 


1.5 MODELING 


To this point we have taken the complete model (1.4)—(1.6) as given. In fact, the 
investigator begins with much less. Typically the vector of interest œ is specified (at 
least implicitly) by the client making the decision. The composition of the observ- 
ables vector is sometimes obvious, but in general the question of which observables 
are best used to inform quantitative decisionmaking is itself an important, interest- 
ing, and sometimes difficult question. 

This leaves almost all of (1.4)—(1.6) to be specified by the investigator. There 
is, of course, no algorithm mapping reality into models. The ability to isolate the 
important features of an actual decision problem, and organize them into a model 
that is workable and brings to bear all the important features of the decision is an 
acquired and well-rewarded skill. However this process does involve some spe- 
cific technical steps that themselves can be cast as intermediate decision problems 
addressed by the investigator. 

One such step is to incorporate competing models A1, A2, ..., A, in the process 
of inference and decisionmaking. In Section 1.2 we constructed a joint probability 
distribution for the unobservables 04, the observables y, and the vector of interest 
@, in the context of model A. Suppose that we have done that for each of models 
A,,..., A, and that the vector of observables is the same for each of these models. 
Then we have 


POs, | Aj), P(y | 94,, Aj), P@ | 94,,Y, Aj) G= Daes J). 


If we now provide a prior probability p(Aj;) for each model, with Ys 
p(A;) = 1, there is a complete probability distribution over models, unobserv- 


ables, observables, and the vector of interest. Let A = [le Aj. In each model the 
density (1.14), built up from (1.8) and (1.6), provides p(@ | y°, A;). Then 


J 
plo |y, A) = >> p@ |y, Aj)p(Aj ly, A). (1.17) 


j=l 


The posterior density of œ is given by (1.17) with the data y° replacing the observ- 
able y. It is a weighted average of the posterior densities of œ in the various models; 
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indeed, (1.17) is sometimes called model averaging. The weights are 


pa, [yay = PAP IAD PADO IAD ag 


PON F Panag? (A) 


The data therefore affect the weights by means of 
py’ A) =f p@a,.¥" 1A) dB, 
a; 
=f P(Oa, | Aj) P(y® |04, Aj) dO4,. (1.19) 
a, 


The number p(y° | A;) is known as the marginal likelihood of model A;. The 
technical obstacles to the computation, or approximation, of p(y° | A;) are at least 
as severe as those for simulating 04, but rapid progress on this problem was made 
during the 1990s, and this is becoming an increasingly routine procedure. 

For any pair of models (A;, A;), we obtain 


P(A; | y°) a p(Aj) f p(y? | Ai) 
P(Aj ly’) pA) ply? | Ay) 


(1.20) 


Note that the ratio is independent of the composition of the full complement of 
models in A. It is therefore a useful summary of the evidence in the data y° about 
the relative posterior probabilities of the two models. The left side of (1.20) is 
known as the posterior odds ratio, and it is decomposed on the right side into the 
product of the prior odds ratio and the Bayes factor. Expressions (1.17) and (1.18) 
imply that providing the marginal likelihood of a model is quite useful for the 
subsequent work, including decisionmaking, with several models. 

Expression (1.19) for the marginal likelihood makes plain that the bearing of a 
model on decisionmaking—its weight in the model averaging process (1.17)— 
depends on the prior density p(@4, | A;) as well as the observables density 
p(y | @4,, Ai). In particular, a model A; may be an excellent representation of 
the data in the sense that for some value(s) of 04,, p(y? | 94;, Ai) is large rela- 
tive to the best fit p(y° | @4,, Aj) in other models, but if p(@4, | Ai) places low 
(even zero) probability on those values, then the posterior odds ratio (1.20) may 
run heavily against model A;. 

The investigator’s problem in specifying p(04, | A;) is no more (or less) difficult 
than that of designing the observables density p(y | 04,, A;). The two are inseparable: 
p(84, | Ai) has no implications for observables without p(y | @4,, Ai), and p(y | 
04,, Ai) says little about p(y | A;) until we have p(@4, | A;) in hand. The first two 
components of any complete model, (1.4) and (1.5), combined with some relatively 
simple simulation, can help in these steps of the investigator’s problem. Suppose that 
one or more aspects of the observables y, which we can represent quite generally as 
g(y), are thought to be important aspects of reality bearing on a decision, that therefore 
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should be well represented by the model. In the case of our financial decisionmaker 
from Section 1.1.2, one concern might focus on the model’s stance on “crashes” in 
the value of financial assets like the one day return of worse than —20% experienced 
during October 1987, for many assets; then g(y) = | if y exhibits such a day and 
g(y) = 0 if not. For any specified prior and observables densities, it is generally 
straightforward to simulate 


oP p41 A), y™ {a ply 10%”, A) 


and then construct g(y). The resulting g(y) (m = 1,...,) is an independent 
identically distributed (i.i.d.) sample from p[g(y) | A]. 

This process enables the investigator to understand key properties of a model 
A before undertaking the more demanding task of developing a posterior sim- 
ulator oe” ~ p(O4 | y°, A). It provides guidance in choosing the prior density 
p(@4 | A) corresponding to p(y | 04, A), and can reveal that an observables density 
p(y | 94, A) fails to capture important aspects of reality no matter what the value of 
04. These tasks are all part of what is generally referred to as “model specification” 
in econometrics. We shall return to them in detail in Chapter 8. 


1.6 DECISIONMAKING 


The key property of the vector of interest œ is that it mediates aspects of real- 
ity that are relevant for the decision that motivates the econometric or statistical 
modeling in the first place. To illustrate this point, return again to the decision 
of school administrators about the class sizes described in Section 1.1.1. School 
administrators prefer certain outcomes to others; for example, it is quite likely that 
they prefer high test scores and small teaching budgets to low test scores and large 
expenditures for teachers’ salaries. Suppose, for sake of simplicity, that the teach- 
ing budget can be controlled with certainty by hiring more or fewer teachers. In 
Bayesian decision theory such a decision is known as an action, and represented 
generically by a vector a. The vector of interest œ includes all the uncertain fac- 
tors that matter to administrators in evaluating the outcome; it could be a single 
summary of test scores, or it might disaggregate to measure test outcomes for 
different groups of students. The expected utility paradigm, associated with von 
Neumann and Morgenstern (1944), states that decisions are made so as to max- 
imize the expected value of a utility function U (a, œ) defined over all possible 
outcomes and decisions. The term “utility” is universal in economics, whereas in 
Bayesian decision theory the concept of “loss” prevails; the loss function L(a, @) 
is used in place of the utility function U (a, œ). The only distinction is that the 
decisionmaker seeks to minimize, not maximize, E[L(a, w)]. We can always take 
L(a, œ) = —U(a, œ). 

This paradigm fits naturally into the relationship between the model A with 
parameter vector 0 4, the observable vector y, and the vector of interest œ. Expres- 
sion (1.11) provides the distribution relevant to the decision in the use of a single 
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model A—that is, the distribution relevant for the expectation E[L(a, w)], which 
therefore may be written 


E[L(a,@) ly’, A] = / La, @)p(w | y’, A) dø 
Oa 
z! f L(a, @œ)p(04 | y°, A)p(@ | 04, y°, A) dba dø. 
a Je, 


Section 1.4 outlines how, in principle, we might obtain drawings w” from 
(1.11). Typically those drawings can be used to solve the formal decision problem. 
In the simplest case, there are only two possible actions (a = 0, a = 1), and the 
drawings w” are i.id. Then, so long as E[L(0, œ)] and E[L(1, w)] both exist—a 
requirement for the expected utility paradigm to be applicable—the strong law of 
large numbers implies 


M 
M`! È L(a, 0”) > E[Lia@) ly’, Al 


m=1 


for a = 0 and a = 1. More generally, if a is continuous and E[L(a, w)] is twice 
differentiable, then typically 


MS aLa, wo”) /da > dE[L(a, w) | y’, A]/ða 


m=1 


and this feature may be exploited to solve for the value a=4@ that minimizes 
expected loss, using a steepest-descent algorithm. More often, the draws @” from 
(1.11) are serially dependent, but this complication turns out not to be essential. 
We revisit these issues at the level of technical detail required for their application 
subsequently in Chapter 4. This formalization of the decisionmaking process can 
be extended to the case of several competing models, using the setup developed in 
Sections 2.6 and 8.2. 

Decisionmaking plays, or should play, an important role in modeling and infer- 
ence. It focuses attention, first, on the vector of interest œ that is relevant to the 
decision problem—namely, the unobservables that will ultimately drive the sub- 
jective evaluation of the decision ex post. Given œ, we may then consider the 
observables y that are most likely to be useful in providing information about 
@ before the decision is made. The observables then govern consideration of the 
relevant models A, their vectors of unobservables 04, and the associated prior 
densities p(@4 | A). Note that this amounts to stepping backward through the 
marginal—conditional decomposition (1.7), a process that is often informal. 

In practice, formal decisionmaking is most useful for the structure that it places 
on the research endeavor from start to finish. Rarely, if ever, do decisionmakers 
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think and talk about decisions entirely and explicitly within the formal frame- 
work we have laid out here. However, the discipline of formal decision theory 
combined with Bayesian inference can, when well executed, earn the respect of 
real decisionmakers, and therefore a “seat at the table.” In many ways, this is the 
ultimate goal of Bayesian inference, and achieving it is a high reward to applied 
econometrics and statistics. 


CHAPTER 2 


Elements of Bayesian Inference 


This chapter systematically develops the principles of Bayesian inference that are 
used repeatedly in the rest of the book. The purpose is threefold: to set up nota- 
tion, to provide an introduction for statisticians and econometricians unfamiliar with 
Bayesian methods, and to set forth some technical challenges addressed in subse- 
quent chapters. The development emphasizes the eventual application of Bayesian 
inference in decisionmaking contexts. 

The introduction here is concise, concentrating on analytic essentials and touch- 
ing lightly on some concepts of greater depth. Those versed in Bayesian methods 
at the level of Berger (1985) or Bernardo and Smith (1994) can easily skip to the 
fourth chapter and beyond, consulting Section 2.1 as required for notation. Those 
seeking a complete introduction can consult these references as well as the next 
chapter, perhaps supplemented by DeGroot (1970), Berger and Wolpert (1988), 
and Poirier (1988) on the distinction between Bayesian and non-Bayesian meth- 
ods. On Bayesian econometrics in particular, see Zellner (1971), Poirier (1995), 
Koop (2003), and Lancaster (2004). All the concepts introduced in this chapter are 
illustrated using the normal linear regression model. 

The results presented in this chapter are not operational. In particular, they 
all involve integrals that rarely can be evaluated analytically, and the dimensions 
of integration are typically greater than the four or five for which deterministic 
numerical methods are practical. Chapter 4 provides the analytical development 
of posterior simulators, which are then used in the practical procedures developed 
subsequently. 


2.1 BASICS 


Bayesian inference takes place in the context of one or more statistical or econo- 
metric models. A model describes the behavior of a p x 1 vector of observable 
random vectors y;. The index ¢ has one of a number of interpretations determined 
by the application including time, individuals in a random sample, location, as 
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well as combinations of these and other relevant attributes of the observables. Let 
Y, = {ys}{_, denote the subsample consisting of the first t observables. The sample 
space for y, is y,, that for Y, is Y,, and wy = Yo = {ø}. A model, A, specifies a 
corresponding sequence of probability density functions 


P: | Yr-1, 9a, A), (2.1) 


in which 04 is a ką x 1 vector of unobservables and 04 € ©} C R*. The inter- 
pretation of 0 4 depends on the model A. In many contexts 0 4 is a vector of a fixed 
number of unknown parameters. However, 0 4 may also include unobservables that 
economists and other social scientists call “latent variables,” and in that case k, 
typically depends on the size of the sample. We return to this interpretation in more 
detail in Section 3.1. 

The notation p(-) will be used to denote a generic probability density function 
(pdf) with respect to a generic measure v(-). The measure v permits the random 
vector to be continuous (Lebesgue measure), discrete (point mass) or a mixture 
of the two. Thus, for example, if v(-) assigns ordinary (Lebesgue) measure to the 
unit interval and the measure one to the point x = 0.7, and p(x) = (5) To,1) (x), 
then P(A) = f a P(x) dv(x) is the probability function corresponding to a random 
variable that is uniformly distributed on the unit interval with probability $, and 
takes on the value 0.7 with probability 5. 

The pdf of Y7, conditional on the model and the vector of unobservables 0 4, is 


T 
PYr 104, A) = | | 20: | Yii, 64, A). (2.2) 


t=1 


If the model specifies that the y, are independent and identically distributed, then 


PY: | ¥i-1,94, A) = p(y: | Oa, A) 
and in this case 


T 
P(Yr |04, A) =] | py | a, A). 


t=1 


More generally, the index t may pertain to cross sections, to time series, or both. 
Time series models and language preserve this generality. 

When used alone, expressions like y, and Yy denote random vectors. In 
equations (2.1) and (2.2) y, and Yr are arguments of functions. These uses are 
distinct from the observed values themselves. To preserve this distinction explic- 
itly, denote observed y, by y? and observed Yr by Y7.. In general, the superscript 
o will denote the observed value of a random vector. For example, if the observed 
value of the random vector Yy is Y4, then the likelihood function is any function 
L(04; Y}, A) x p(¥% |04, A). Unless the number of observations, T, is important 
to the topic at hand, we shall simply denote the observables by y, their observed 
values (the data) by y’, and the sample space by W. 
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The pdf (2.2) is the first component of the model A introduced in Section 1.2. 
The second component is the prior density p(@,4 | A). The prior density is a 
formal representation of the values of the vector of unobservables 04 that are 
reasonable in the model A. It reflects everything that is known, or believed, 
about 04 prior to learning the observed values y°. For example, if ©, has finite 
Lebesgue measure, then the prior density could assign probability uniformly in ©,: 
p04) =[v(eO a)|7!. The notation extends to the extreme case in which the model 
assigns exact values 0% to all unobservables, 04 = 0%. In that case v places point 
mass at 0% and p(0%) = 1. In general, the functional forms of the observables 
densities p(y; | Y;-1,0,4, A) ((=1,..., T) and the prior density p(@,4 | A) are 
chosen simultaneously as part of the model A as discussed in Section 1.2. 

Given (2.2) and the prior density, we obtain 


PCy, 94 | A) = p@a | Ap | Oa, A). (2.3) 


Thus model A provides a joint density of the observables, y, and unobservables, 
0,4. Expression (2.3) decomposes this density as a marginal density in 0,4 (the 
prior) and a density in y conditional on 0 4 (the data density). The joint density can 
also be expressed as the product of the marginal density in y and the conditional 
density in 04: 

PCY, 94 | A) = p(y | A)p@a |y, A). (2.4) 


Both terms on the right side of (2.4) may be written in terms of the prior density 
and observables density. The marginal density in y is 


py = f ply, 8. | Advan) = | 


© 


POa | Apy | Oa, A)dv(a). (2.5) 


Note that the integral appearing in this expression is absolutely convergent for 
almost all y, because p(y, 04 | A) is the joint distribution of y and 04. Expression 
(2.5) is the density of the observable y implied by the model a priori. It is a 
prediction of what the data will be, before they are observed. It makes explicit the 
predictive content of the model A, and indicates the instrumental role of the vector 
of unobservables 0 4 in expressing this prediction. This expression also emphasizes 
that since the observables density and prior density are complementary in forming 
the predictions of the model, they should be chosen together. If y is replaced with 
y° in (2.5), then p(y? | A) is a real number. 


Definition 2.1.1 The marginal likelihood of the model A is p(y? | A). 


This terminology [which dates at least to Raiffa and Schlaifer (1961), Section 
2.1] reflects the fact that with y° in place of y, (2.5) can be interpreted as 
“marginalizing” the vector of unobservables in the likelihood function, that is, 
py? | A= Jo, L(04; y°, A)dv(0 4). But note that this is true only if L(0 4; y°, A) 
carries forward all constants of integration in p(y°, 04 | A), that is, L(04; y°, A) = 
p(y’, 0a | A) and not simply L(04; y°, A) « ply’, 04 | A). 
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The second conditional density on the right side of (2.4) is 


— pa | Apy | a, A) 
PO, |y, A)= Ge (2.6) 


If y is replaced with y° in (2.6), then this expression provides the distribution of the 
vector of unobservables 04 conditional on the data y° in the context of the model 
A. Since p(@4 | A) conveys what is known about 04 before (“a priori”) learning 
y’, p(@4 | y°, A) communicates what is known after (“a posteriori”) learning y?. 


Definition 2.1.2 The posterior density of the vector of unobservables 04 in 
the model A is 
o PO | Ap? |04, A) 
p@a ly’, A) = A (2.7) 
pty? | A) 

The expression in the denominator of (2.7) is the marginal likelihood. In many 
circumstances it suffices to know just the shape of the posterior density p(04 | 
y°, A) and it is costly to evaluate p(y° | A). In this case it is useful to exploit the 
fact that 


Pa |y’, A) x p@a | A)ply® | Oa, A). (2.8) 


Definition 2.1.3 Any nonnegative function k(x) proportional to a probability 
density function p(x) is a kernel of p(x). 


The expression on the right side of (2.8) is a kernel of the posterior density. In 
general, any finitely integrable nonnegative function is the kernel of some proba- 
bility density function. To emphasize the distinction between the posterior density 
function proper and a kernel of that function, we shall sometimes refer to (2.7) as 
the normalized posterior density, whereas the right side of (2.8) is the posterior 
density kernel in standard form. Any function k(@4 | y°, A) « p(@4 | y°®, A) is a 
kernel of the posterior density. 

The third and final component of the model A is a vector of interest œ € Q C R1 
representing entities the model is intended to describe, together with a conditional 
density p(w | y,@4, A). Whereas 6,4 is specific to the model A, œ remains the 
same across models. This includes a wide range of possibilities. 


Example 2.1.1 Vector of Interest in the Value at Risk Example If we are inter- 
ested only in the portfolio value 5 days hence, we can take w = pr exp 1 YT+s). 
A model of asset returns, A, provides p(y, | Y;-1,0,4, A), and thereby p(a | 
Y4, 04, A). The simple, but unrealistic, model (1.3) implies log(@|pr) | (u, o) ~ 
N (5u, 50°) conditional on 04 = (u, o?)'. As we shall see (Examples 2.1.2 and 
2.3.3), for certain prior distributions we may remove the conditioning on u and 
o? and obtain similar compact expressions for the distribution of œ. In general, 
however, this will not be possible, and in particular it cannot be done in the more 
realistic model (1.12)—(1.13). 
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Definition 2.1.4 A complete model A consists of three components: the 
observables density p(y | 04, A), the prior density p(@, | A), and the vector of 
interest density, p(@ | y, 04, A). 


The objective of inference can generally be expressed as the posterior density 
of the vector of interest œ. When there is just one model, A, this is 


p(@ | y°, A) =| 


© 


plo |y°,04, A)pOa | Y°, A)dv@a). (2.9) 


By means of (2.8) and (2.9), p(@ | y°, A) is expressed in terms of the three com- 
ponents of the model A. Quite often—but by no means always—the objective of 
inference can be expressed E[h(@) | y’, A] for suitably chosen h(-). This formu- 
lation includes several special cases of interest. 

If a hypothesis restricts 0 4 to a set O49 C O4, then, by taking h(w) = Io, (04), 
we have E[h(@) | y°, A] = P(@4 € O10 | y’, A), the posterior probability that the 
hypothesis is true in the context of model A. For example, suppose that in the 
class size example (Section 1.1.1) the investigator uses a normal linear model in 
which class size is the second component of x,, and wishes to ascertain whether 
increasing class size lowers test scores. Then ©4ọ might consist of all those sets 
of parameters (£, h) for which £, < 0. 

Another important class of cases arises from prediction problems, œ’ = 
(YT+1; -- -> YT+q). The appropriate choice of h(@) may include expected values, 
turning point probabilities, and predictive intervals. In the value at risk example 
(Section 1.1.2) let ws = pr exp% Yr4r) (s =1,...,5) denote the portfolio 
values over the next 5 days. The maximum value over the period corresponds 
to h(@) = sup,-1,. 5 @s. To assess the probability of a turning point in portfolio 
value on day T + 3, we would set h(w) = 1 if œ < @3 > w4 OF @2 > ©3 < w4 
and h(@) =0 otherwise. To assess the probability that pr — pris > c define 
h(@;c) = 1 if ws < pr —c and h(@w; c) =0 otherwise. The value at risk prob- 
lem is to find that c for which E[h(@; c) | Y$, A] = .05, a relatively easy task if 
we can compute E[h(@; c) | Y4, A] for any value of c. 

Yet another useful class of functions arises whenever a decisionmaker must take 
one of two actions, a; or a). Then, (@) = L(a, œ) — L(a, œ), in which L(a, œ) 
denotes the loss incurred if action a is taken and then the realization of the vector of 
interest is œ. In the drug approval example at the beginning of Chapter 1, the FDA 
must decide whether to approve a drug; œ might be a vector of health outcomes. 
In the merger example, the regulatory authority must either approve or disapprove 
the proposed merger; œ might be a vector of prices of products produced by the 
firms involved. 


Example 2.1.2 Normal Linear Regression Model For an observable T x 1 vec- 
tor of dependent variables y and T x k matrix of fixed covariates X, assume 


y | (B,h,X, A) ~ N(XB,ho'Iy); rank(X) = k. (2.10) 
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The fixed and observed covariates X are part of the model specification A, but it 
will prove convenient to include them explicitly in the notation for the conditional 
distribution of observables (2.10). We shall sometimes write X’ = [x,,..., Xr] and 
let e denote the T x 1 vector of disturbances, with £; = y; — B’x; ((=1,..., T). 
The parameter h is the precision of each of the i.i.d. disturbances ¢;; it is the 
inverse of var(e,) = 0”. (More generally, the precision of any random variable is 
the inverse of its variance.) The vector of unobservables is the parameter vector 
6’, = (B’, h). The coefficient vector B and the precision h are independent in the 
prior. The prior distribution of B is 


B | A~ N(B, H`’). (2.11) 


In (2.11) the mean B is a k x 1 vector of constants. This vector is specified as 
part of the prior distribution. The precision H is a k x k positive definite matrix of 
constants, also specified as part of the prior distribution. (In general, an underscore 
will denote constants in prior distributions.) The prior distribution of h is 


shj A~ xLw). (2.12) 


The formulation (2.12) is a concise way of expressing a gamma distribution for h. 
In general, the two-parameter gamma distribution can be represented in this form 
[see Johnson et al. (1994), Section 17.3]. In a given application, (2.11)-(2.12) is not 
necessarily an adequate representation of prior information and beliefs, and other 
prior distributions could be used. However, (2.11)—(2.12) has attractive analytical 
properties that will become clear in due course, and Section 8.4 discusses methods 
for modifying this and other prior distributions. 

In (2.11) £ , H, and H are respectively the prior mean, prior precision, and 
prior variance of B. Thus 


1/2 


p(B | A) = 20)“ |] exp[—(B — B'A — B)/21. (2.13) 


To derive the probability density corresponding to (2.12) recall that if w ~ 
x° (v) then 
p(w) = [PT 0/D] wt exp(—w/2), (2.14) 
E(w) = v, and var(w) = 2v. Hence the prior mean and variance of h are E(h | 
A) = v/s? and var(h | A) = 2v/s*, respectively. Through the usual change of vari- 
able, the pdf of h is 
p(h | A) = [PT O/D Eh? exp(—sh/2). (2.15) 


Another change of variable yields the prior pdf of o? = h7!: 


po? | A) = [2271 O/D | (8?) #7 (07)? exp(—s*/20°). (2.16) 
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From the specification (2.10), we obtain 
p(y | Bh, X, A) = Qr) Ph"? exp[—h(y — XB)'(y — XB)/2]. (2.17) 


The posterior density kernel in standard form is the product of (2.13), (2.15), and 
(2.17) evaluated at y°: 


(2r) canal Po OD (2.18a) 
-Hl 62)? (2.18b) 
ATH)? exp(—s7h/2) (2.18) 


-exp{—[(B — BVHE — B) + h(y’ — XB)'(y’ — XB)]/2}. (2.18d) 


To interpret this expression, it is useful to begin with some algebra. Complete the 
square in B of the term in brackets in (2.18d) to obtain 


(B — B)'H(B — B) + h(y’ — XB)'(y’ — XB) = (B — B)'H(B — B) +O, 


where 
H = H+ hXx’X, (2.19) 
p= H` (HB +hX'y) = H` (HB + hX’Xb), (2.20) 
Q =hy”y’ + B'HB — B HB, (2.21) 


where b denotes the coefficients in the ordinary least-squares fit of y° to X, b= 
XX) X'y’. 

If (2.18a)—(2.18d) is interpreted as a function of B only, then (2.18d) is a pos- 
terior density kernel for B conditional on h, and our square completion shows that 


p(B | h,y’, X, A) œ exp[—(B — B)'H(B — B)/21. (2.22) 


Consequently 
B | (h,y’,X,A)~ N(B,H ). (2.23) 


The conditional posterior distribution of B is normal because the prior distribution 
of B is normal and the likelihood function in £, (2.18d), is a kernel of a multivariate 
normal distribution. Note the symmetry of the prior and observables distributions 
as they are combined in (2.19), (2.20), and (2.23). The precision of the posterior 
distribution is the sum of the prior precision and the term X’X. The latter is the 
posterior precision in the limit as the prior precision H — 0, and might therefore 
be called the observables precision matrix. The mean of the posterior distribution 
is the matrix weighted average of the prior mean f and the vector b. The latter is 
the limiting posterior mean as H —> 0. The respective matrix weights are the prior 
precision matrix H and the observables precision matrix X’X. 
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Interpreting the product of (2.18a)—(2.18d) as a function of h alone, we see 
that a kernel of A | (B, y°, X, A) is the product of (2.18c) and (2.18d). Hence the 
posterior density of A conditional on £ and the data is 


p(h | B, Y°, X, A) x hL exp{—[s? + (Y° — XEY Y° — XB)Ih/2}. (2.24) 
Comparing this expression with (2.12) and (2.15), it is evident that 


Sh | (B,y?, X, A) ~ x° O) (2.25) 


where 
3 = s? + (y? — XB) (y? — XB) and D=T +v. (2.26) 


There is again an evident symmetry between the prior and the observables, and it 
again arises because the kernel of the prior density for h, (2.15), and the kernel of 
the observables density function in h, (2.17) have the same functional form. 

The prior distributions (2.11) and (2.12) are attractive because they lead to the 
simple and interpretable results (2.23) and (2.25). We shall study this property more 
generally and systematically in the consideration of conjugate and conditionally 
conjugate priors, in Section 2.3. The results (2.23) and (2.25) are not immedi- 
ately useful, for they do not provide distributions conditional only on the data and 
prior information. We cannot obtain these distributions analytically, although this is 
possible using different priors. Alternatively, a numerical approach may be taken. 
We will pursue the former strategy in Section 2.3, and the latter method will be 
developed in Section 4.3. 


Example 2.1.3 Geometric Interpretation of the Normal Linear Regression 
Model with Two Covariates There is an informative geometric interpretation 
of the posterior mean B, conditional on h, due to Leamer (1973). A geometric 
representation of the level contours of the prior pdf of B (2.13) consists of the 
ellipses 


B: (B — BVH — B) = cı 


for various positive constants cı. Because (2.13) implies (6 — PYH(B- B) | A ~ 
x(k), the prior probability that B is in the interior of the ellipse is 1 —a@ if 
cı = x2(k). Interpreting the pdf of y given in (2.17) as a density kernel for £ 
and substituting y° for y, the level contours of that density are the ellipses £ : 
(B — b)'hX’'X(B — b) = cp. 

Now consider the set of points £ such that there is no point B* € R* for which 
both p(B* | A) > p(B | A) and p(y? | B*, 4, X, A) > ply? | B, h, X, A). Through 
the usual constrained optimization calculus, this is the set of points that solves the 
first-order condition for the objective function 


(B — b)'hX’X(B — b) + A(B — PHB — B). (2.27) 
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Figure 2.1. Normal linear model with two regressors— geometric interpretation: (a) prior and likelihood 
contours with locus of posterior means; (b) prior, likelihood, and posterior contours. 


This entire set is indexed by the Lagrange multiplier à. Setting the first derivative 
of (2.27) with respect to B to zero, this curve may be expressed 


BO; h) = [AX’X + AH] (hX'Xb + AHB) 
Observe that B(1;h) = B. 

An example for the case k = 2 is presented in Figure 2.1. The level contours of 
the prior distribution are represented by the lighter ellipses; those for the likelihood, 
by the darker ellipses. The curve B(1; h) is the solid curve shown in Figure 2.1a. 
This locus of tangencies between the ellipses is indexed by h and traces the condi- 
tional posterior mean B as a function of h. In particular, lim}—o B(1; h) = B, and 
lim} (1; h) = b. In the example portrayed, B,(1; h) < b, and 6,(1; h) < b», 
for most values of h. This illustrates some subtleties of the matrix weighted aver- 
age of B and b in (2.20). The darkest ellipses in Figure 2.1b indicate the level 
contours of the posterior distribution of B for the case h = 1. For more on the 
question of the sensitivity of the posterior mean to the prior mean, in this setting, 
see Leamer (1982). 


Exercise 2.1.1 Combining Information “Suppose that a random sample of size 
T is drawn from a normal population with unknown mean u and known precision 
h. There is always a normal prior distribution for u that will lead to a normal 
posterior distribution for u with mean 7 and precision h, for any real m and 
positive h.” 

Is this statement true or false? If true, prove it. If false, provide a counterexample, 
a correct version of the statement, and a proof of the correct version. 


Exercise 2.1.2 Probability Density Kernels Let p(y) denote the probability den- 
sity of an n x 1 random vector y and suppose that 


log p(y) = y'Ay + by +c. 
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The n x n matrix A is negative definite. 


(a) Show that y has a multivariate normal distribution. Express its precision H 
in terms of A and its mean p in terms of A and b. 


(b) Express c in terms of H and pm. 


Exercise 2.1.3 The Generalized Normal Linear Regression Model Suppose 
that in model A of Example 2.1.2, y | (B, X, A)~N (XB, V), where V is a known 
T x T positive definite matrix. Show that given the prior distribution (2.11) the pos- 
terior distribution of £ | (y’, X, A) is then B ~N (B, H’), and provide expressions 
for H and B. 


Exercise 2.1.4 The Short Rank Normal Linear Model Suppose that in the nor- 
mal linear model of Example 2.1.2, rank(X) < k. All other aspects of the model 
remain the same as in Example 2.1.2. 


(a) Show that £ | (h = 1, y°, X, A) ~ N(B, H '), and provide expressions for 
H and B. [Of course, the least-squares estimate b = (X’X)~!X’y° does not 
exist in this case.] 

(b) Show that if k = 6, then 


P[(B—B)'H(B—B) < 12.5922 | (h = 1, y?, X, A)] = .95. 


(c) Suppose that for the k x 1 vector a, it is the case that Xa = 0. Suppose also 
that H = I}. Show that the prior distribution and the posterior distribution 
of a’B are the same. For k = 2, draw a sketch in 6, and £, space similar 
to Figure 2.1a to illustrate what is going on. [Hint: (H + X’X)~! =H! — 
HX (XH !X + I;)-!XH"!] 

Rework (c) for the case of any nonsingular prior variance matrix H~!. For 
some linear combinations a’B, the prior and posterior distribution are the 
same. What can you say about the vectors a for which this is true? How 
does the sketch you drew in (c) change for this more general case? 


(d 


han 


Exercise 2.1.5 Distribution of o? and Related Parameters This exercise is 


about the prior distribution of ø? in (2.16), its properties, and related distributions. 


(a) Derive (2.16) from (2.15). 

(b) Suppose that o? has the pdf (2.16). Derive the mean and variance of a”, 
indicating any additional assumptions that are necessary for these moments 
to exist. [Hints: 

(i) This is not an elaborate integration problem. What does (2.16) tell you 
about the integral of the kernel of the pdf of 07? 
Gi) r(x + 1) = xT (x) for all x > 0.] 

(c) If o? has the pdf (2.16), what is the pdf of o? Express the median of 
the distribution of o in terms of the median X59) of the chi-squared 
distribution with v degrees of freedom. 
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2.2 SUFFICIENCY, ANCILLARITY, AND NUISANCE PARAMETERS 


The steps that are undertaken to derive the posterior distribution p(04 | y°, A) or 
the marginal likelihood p(y° | A) depend on the relations between y° and 04 in 
these expressions. In particular circumstances these expressions can be simplified. 
Two of the most useful arise when the data can be reduced to a smaller set of 
statistics (called sufficient statistics) for the purpose of inference, and again when a 
subset of this set (called ancillary statistics) can be regarded as fixed for the same 


purpose. 


2.2.1 Sufficiency 


Definition 2.2.1 The vector s = s(y; A) is a sufficient statistic in a model with 
the observables density p(y | 04, A) if 


ply | s(y; A), 04, A] = ply | s(y; A), A] Y 04 € Oa. (2.28) 


Heuristically, (2.28) implies that there is no information about y originating in 
04 in the density p(y | 04, A), beyond that conveyed by s ex ante. This suggests 
that in learning about 04, nothing would be lost by confining attention to s(y; A) 
rather than y. This is indeed the case. 


Theorem 2.2.1 Ex Post and Ex Ante Equivalence of Sufficiency The vector 
s(y; A) is a sufficient statistic in a model with observables density p(y | 04, A) if 
and only if for all 04 € ©, and for all y € W, we have 


Pa |y, A) = plô a | sty; A), A]. (2.29) 


Proof: Suppose that (2.28) is true. Then 


p(y | 04, S, A)p(04 | S, A) 
pty |s, A) 


p(O4 |s, A). 


PO, |y, A) = p04 ly, s, A) = 


_ POIs A)p@a |s A) _ 
ply |s, A) 


Conversely, if (2.29) is true, then 


p(Oa4 ly, s, A)pty | s, A) 
pO, |s, A) 
p(0a | y, A)ply | S, A) 


eua o OA e 


P(y |s, 04, A) 


Note that the conditions in Theorem 2.2.1 hold for any choice of the prior density 
p(0 a | A) and vector of interest w. This is because sufficiency is a property of the 
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observables density p(y | 04, A) alone. In demonstrating the sufficiency of s(y; A) 
in an observables density, it is usually easiest to use a third, equivalent condition. 


Theorem 2.2.2 Factorization Criterion The vector s(y; A) is a sufficient 
statistic in a model with data density p(y | 04, A) if and only if 


Ply | 94, A) = pis(y; A) | 04, Alr (y; A) (2.30) 
for some function r(y; A). 


Proof: It suffices to show that (2.29) and (2.30) are equivalent. First suppose 
that (2.30) is true. Then 


P(y | Aa, A)p(Oa | A) 


0 ,A)= 
Pa ly, A) ree 
04, A ;A)p(@,|A 
_ P(S| Oa, Ar; A)P@a | J pi, eae 
p(y | A) 
On the other hand, given (2.29), it follows that 
Pa ly, Ap | A) 
04, A) = ——————_ 
py |04, A) PONES 
pO | A) 
= p(04 | s, A) ————— 
POs | On A) 
_ PS|0a, A)P@al A) POIA 
p(s | A) p(O4 | A) 
s|04, A A 
-PETUA A 
p(s | A) 
where in the last equation r(y; A) = p(y | A)/p(s | A). | 


The factorization criterion is particularly useful in demonstrating that s(y; A) is 
a sufficient statistic. This is because it is often relatively easy to demonstrate that 
for the likelihood function 
La: y, A) = LIb 4; sy; A), A]. (2.31) 
It is always the case that 
L04; y, A) = p(y | 44, Ani), 


where 7;(y) absorbs any constants excluded from the likelihood function, and 


LIO a; sy; A), A] = pis(y; A) | 04, Alro(y), 
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where r2(y) absorbs any such constants, including the Jacobian of transformation 
between y and s(y; A). Hence, if (2.31) is true for all 04 € ©, and y € Y, then 
s(y; A) is a sufficient statistic. 


Example 2.2.1 Sufficient Statistics in the Normal Linear Regression Model 
From (2.17), we have 


ply | B, h, X, A) œ h’ exp[—h(y — XB)'(y — XB)/21. 
Completing the square, we obtain 
Y — XB)'(y — XB) = (y — Xb)'(y — Xb) + (B — b)'X’X(B — b) 
= s° + (B —b)'X’X(B — b), 


where b = (X’X)~!X’y, and s? = (y — Xb)'(y — Xb). By the factorization crite- 
rion, [b, s?, X’X, T] is a sufficient statistic in the normal linear regression model 
A. This is equivalent to [Z’Z, T], where Z = [X, y]. 


2.2.2 Ancillarity 


Definition 2.2.2 Suppose that s(y; A) is a sufficient statistic in the observables 
density p(y |04, A). If there exist partitions s’ = (si,s,) and 6’, = (0h41 0445) 
such that 


P(@4 | A) = pa | Apa | A), (2.32) 
P(S1 | 94, A) = p(s; | O41, A), (2.33) 
P(S2 | S1, 04, A) = p(S2 | $1, O42, A), (2.34) 


then sı is an ancillary statistic with respect to 042. 
Ancillarity implies 


pa |y, A) x p@a |s, A) 
x pai | A)p@a2 | A)p(Si | 4, A) p(S2 | $1, 04, A) 
= pai | A)p(Si | O41, A) Paz | A)p(S2 | $1,842, A). (2.35) 
It simplifies inference when the vector of interest œ depends on 642 and y, but not 


041, Or, alternatively, when œ depends on 041 and y, but not 042. In the first case 
D(@ |y, 0,4, A) = p(@ | y, 042, A) and (2.35) implies 


P(O42 |y, A) œ pa | A) p(S2 | $1, O42, A). 
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Then 


P@|y, a=) Pp(@|y,04, A)p(@a |y, A)dv(0a) 


Oa 


x f P(@ |y, 042, A)p(O a | A) p(S2 | s1, 042, A) dv(0 a2). 
Oar 


This means that for purposes of learning about œ from the data y° it is necessary 
only to use the prior density of 042, p(@42 | A), and the conditional density of 
So, P(So | $1,942, A). Since p(@|y, A) is not affected by the prior distribution 
of 0,4; or the marginal density p(s, | 04, A), it is not necessary to develop these 
distributions beyond establishing the properties (2.32)—(2.34). The random vector 
sı(yr; A) can be treated as fixed, and the parameter vector 04; can be ignored. 


Example 2.2.2 Ancillarity in the Normal Linear Regression Model Mod- 
ify the assumptions made in Example 2.1.2 by making X random, with pdf 
p(X |, A),n € A. If p(B,h,n | A) = p(B, | A)p(y | A), then X is ancillary 
with respect to (£, h). If the distribution of the vector of interest depends only on 
B and h, the matrix X can be treated as fixed and the parameter vector ņ ignored. 
That is exactly what was done in Example 2.1.2. Therefore that treatment of the 
normal linear regression model with X fixed is also appropriate when X is random 
but ancillary with respect to (8, h), and the distribution of the vector of interest 
depends only on B and h. This happens often in applied work. 


If p@|y,04, A) = p@|y, 6,41, A), the factorization (2.35) is also useful, 
because then 


p@|y, A) « f pæ |y, 041, A)p@ai | Ap(sı |041, A) dv 41). 


a, 


Since p(@ | y, A) is not affected by the prior distribution of 0 42 or the conditional 
density p(s2 | s1, 8,42, A), it is not necessary to develop these distributions beyond 
establishing the properties (2.32)—(2.34). The random vector s can simply be 
ignored. 


Example 2.2.3 Missing Data It is sometimes the case that not all of the observ- 
ables y are, in fact, observed. For example, in a survey some respondents may 
not reply to some questions; time series data may be quarterly before a certain 
date and monthly thereafter. Let y’ = (y/,, y,,), where y, denotes the observables 
subsequently observed and y,, denotes those that are subsequently missing. Let the 
inclusion indicator I be isomorphic to y with J, = 1 if y; € yo and J, = Oif y; € Ym. 
Without further assumptions a complete model must specify p(y, I |04, A). But 
suppose further that 6’, = (0'41, 0'42) and 


PCY T| O41, 942, A) = pO 194), AVPA | Y, O49, A) (2.36) 
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and p(@4 | A) = p(@a1 | A) p(0a2 | A). If, in addition 


pi | Yos Ym, O42, A) = pd | Yos O42, A) (2.37) 


then the missing observables y,, are said to be missing at random. From (2.36) 
Po, I | 04, A) = fow Ym | O41, A)p(I | Yo: Ym, O42, A) dV(Ym) 


= pA | yp, 0an, A) f rive PR E E 
E E A TEN A EN (2.38) 


[If pd | y, 042, A) = pī |12, A), then the missing observables y,, are said to 
be missing completely at random, which of course implies (2.37). For further 
discussion, see Gelman et al. (1995) or Little and Rubin (2002).] Comparison of 
(2.38) with (2.33) and (2.34) shows that y, is an ancillary statistic with respect to 
042. If the vector of interest œ does not depend on 042, as is generally the case, 
then the investigator need be concerned only with p(@4; | A) and f p(¥o,¥m | 
041, A) dv(y). The last expression is cumbersome, in principle, but can often be 
managed easily using simulation methods; see Example 5.2.1 and Exercises 5.3.3, 
6.4.3, and 7.1.1. 


2.2.3 Nuisance Parameters 


If 0’, = (0h 042) and p(w | y, 64, A) = p(@ | y, 042, A) but there is no ancil- 
lary statistic with respect to 042, then 0,4; is a vector of nuisance parameters. 
In non-Bayesian econometrics nuisance parameters can be troublesome, because 
test and related statistics pertaining to E[h(@) | y,@,4, A] depend on the value 
of the unknown parameter vector 042. Nuisance parameters never present any 
fundamental difficulty in Bayesian inference, because they are marginalized in the 
posterior distribution of @: 


rely a= f 


© 


pl@ | y°, 04, A)p@a | y°, A)dv(04) 
= Í P(@ |y’, O42, A pO a | Y°, A) 

x p P(Oa1 | O42, y’, A) avo] dv(O a2) 
= [pow ly’. 8.2. Arp@ xa Ly", A)dv(@2). 


In a posterior simulator, if 0°” ~ p(04 | y’, A) then 0%” can be ignored and 
o™ ~ po | y°, 04), A). 
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Example 2.2.4 Precision as a Nuisance Parameter in the Normal Linear 
Regression Model It is often the case that p(w | B,h, A) = p(@ | B, A) in this 
model. In the context of Example 2.1.2 the precision h is then a nuisance parameter. 
At a formal level, we may therefore work directly with the posterior distribution 
B | (Y°, X, A) instead of ($, h) | (y’, X, A) by integrating h from the joint dis- 
tribution whose kernel is (2.18c)—(2.18d). At a practical level this is challenging 
since there is no closed-form solution except in a few limiting cases like the one 
presented in Example 3.2.1. Numerical procedures provide a ready solution to the 
practical problem for the situation of Example 2.1.2; see Example 4.3.1. 


Exercise 2.2.1 Models for Positive Observables The observable y, is strictly 
positive. In model A, the distribution of y, is exponential: 


yı | (0, A) ~ exp"), pO: | 8, A) = 0 exp(—O91) I0,00) +). 
In model B, the distribution of y, is half-normal: 


yı | (h, B) ~ HN(0, h’), pO: |h, B) 
= (x /2) "h"? exp(—hy?/2) 10,00) 1). 


The observables y,,...., yr are independently and identically distributed. Indicate 
a vector of sufficient statistics in model A, and one in model B. 
(For continuation, see Exercise 2.3.3.) 


Exercise 2.2.2 A Complete Uniform Distribution Model Suppose that 
y1,---, yr are independently distributed, each with a uniform distribution on the 
interval [0, 0]. 


(a) What is p(y1,..., yr | 0, A)? 
(b) Find a 2 x 1 sufficient statistic vector for 0. 


(c) Find the maximum likelihood estimator of 6. What important regularity 
condition underlying the conventional asymptotic distribution theory of max- 
imum likelihood estimators is violated in this case? 


(d) Suppose that the model is completed with the prior density 
P@ | A) = A€xp(—AO) 10,00) (8), 


where A is a specified positive constant. Find a kernel of the posterior density 
for 0. 


(e) Suppose that the model is completed with the prior density p(@ | A) = 
clo) (6), where c is a specified positive constant. Find the posterior den- 
sity (not a kernel) and the moments E(@ | y°, A) and var(@ | y°, A). 


(For continuation, see Exercise 2.3.2.) 


SUFFICIENCY, ANCILLARITY, AND NUISANCE PARAMETERS 37 


Exercise 2.2.3 Sufficiency and Ancillarity for the Uniform Distribution Sup- 
pose that (y,,f = 1,...,7) are independently distributed, each with a uniform 
distribution on the interval [6,6 + 1]. Define 


Ymin = i min (+); Ymax = max (Ys); y* = (Ymin + Ymax)/2; r = Ymax — Ymin- 


S 


(a) Show that (y*, r) is a sufficient statistic. 

(b) Show that y* is not a sufficient statistic. 

(c) Show that the distribution of y; is location-invariant: p(y; | 0 +a) = p(y: — 
a |0). 

(d) Show that the distribution of r does not depend on 0. 

(e) Show that r is ancillary with respect to 0. 


Exercise 2.2.4 A Truncated Normal Distribution Suppose that y, Hid y 


(u, h~!) but that yı is also truncated below at a: 
POr | u, h, a, A) œ exp[-h O: — W)? /2]lja, o) 91). 


(a) Write the probability density p(yı, ..., yr | u, h,a, A). 

(b) Find a nontrivial vector of sufficient statistics s in a model with this observ- 
ables density. (The trivial vector of sufficient statistics is the entire 
data set.) 


(c) Are any of the sufficient statistics in (b) ancillary with respect to (u, h)? 


Exercise 2.2.5 Conditioning in the Normal Linear Regression Model Assume 
the observables distribution 


Yit = Ayr + PiX F BSxo1 + Err (2.39) 
Yu = YX + Y5X3 + Ex (2.40) 
Eir \ iid. 0 h! 0 
~N N 1 
(rL ae] 
for t= 1,...,T. The covariates Xı;, Xx, and X3; are all fixed. This is a sim- 


ple example of what is called a “recursive simultaneous equations system” in 
econometrics. Note that yx is determined in (2.40); conditional on yx, yi; is then 
determined independently in (2.39). 


(a) Indicate a vector of sufficient statistics. (There is more than one right answer. 
But a shorter vector is better than a longer vector.) 

(b) Suppose that the vector of interest w is a function of only œ and B,. What 
are the ancillary statistics, if any? What are the nuisance parameters, if any? 
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(c) In the same situation as (b), can you reformulate the model and thereby 
reduce the vector of sufficient statistics? If so, indicate any ancillary statistics 
and nuisance parameters in this new model. 

(d) Suppose that what is ultimately of interest is P(y1,741 > c | Xr41 = Xr+1), 
where it is assumed that the same model will apply in period T + 1 and 


1 tpl: f 1 
Xray = (Xai, X2, T41 X3, T41) 


What are the ancillary statistics, if any? What are the nuisance parameters, 
if any? 


2.3 CONJUGATE PRIOR DISTRIBUTIONS 


The densities p(04 | A) and p(Yr | 04, A) together represent a belief regarding the 
observables Yy. In selecting the distribution of unobservables, or the conditional 
distribution of observables, the richer the class of functional forms from which to 
choose, the more adequate the representation of prior beliefs possible. On the other 
hand, the choice is constrained by the tractability of the posterior density p(04 | 
Y, A) « p(@4 | A)p(Y¥ | 84, A), which is jointly determined by the choice of 
functional forms for the data density and prior density. The search for rich tractable 
classes of prior distributions may be formalized by considering classes of prior 
densities, p(@4 | y 4, A). In this approach, y, is a vector of constants that indexes 
prior beliefs. In fact, we have already considered this approach in Example 2.1.2, 
in which the parameters indexing prior beliefs were £, H, s?, and v. 


Definition 2.3.1 Suppose that the observables density p(Yr | 04, A) has the 
r x l sufficient statistic vector sr = s7(Y7; A), that r is fixed as T varies, and 
(sr); = T. Denote the corresponding likelihood function 


L(O4; s7, A) = La; sr(Y7, A), A) = La; Y7, A) & pY} | Oa, A). 


Then the conjugate family of prior densities with respect to p(Y7 |04, A) is {pa | 
Ya, A), ya ETa}, where 


pPOa| ya A) x LOA; y4, A) =kOa | YA, A) (2.41) 


and 


Ty, = {ya ff k(0a | y4, A) dv(O4) <o}. 


The kernel of any conjugate prior density may be interpreted as a likelihood 
function corresponding to a notional data set Zi ,), with sample size (y 4); and 
sufficient statistic Sy ,),(Zy,),). To the extent that we can represent prior beliefs 
arising from notional data with the same probability density functional form as the 
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likelihood function, a conjugate prior distribution will provide a good representation 
of belief. Because of (2.41), the prior density and the likelihood function have 
exactly the same functional form in 04. 


Example 2.3.1 The Conjugate Prior Density in a Simplified Normal Linear 
Model Suppose that in the normal linear regression model the precision is known, 
h = ho. From the proof of the Gauss—Markov theorem, recall 


(y’—XB)'(y’—XB) = s* + (B—b)’X’X(B—b), (2.42) 
where s? = (y°—Xb)’(y°—Xb). Hence from (2.17), we obtain 
p(y | B, X, A) x exp[—ho(B—b)X’X(B—b)/2]. (2.43) 


Thus b and X’X are sufficient statistics. Since (2.43) is the kernel of a normal 
density in B, B | A ~ N(B, H`!) is the conjugate prior distribution of 6, with 
y 4 = {B, H}. This prior distribution corresponds to a notional data set in which the 
least-squares coefficient vector (“b”) is B and the moment matrix of the covariates 
(“X'X”) is hy'H. 

A special instance is y; iii y (u, 1). The conjugate prior distribution is u ~ 
N(u, h~'). The notional data set corresponds to a sample of size h with sample 
mean u. 

More generally, the conjugate prior distribution can be expressed in terms of 
notional data in the form 


RB ~N(r, V). (2.44) 
qx 


Often V is a diagonal matrix, and then (2.44) may be interpreted as the combination 
of q independent components of information about B, in the same way that the 
covariates X provide T such independent components in the normal linear model. 
From (2.44) the pdf of the random vector z = RB is 


p(2) = (20)? |V"? exp[—(r — RB)'V~! (r — RB)/2]. 


Hence if rank(R) = k, then (2.44) implies B | A ~ N(B, H)), with H=R’V"!R 
and B = (R'V-'R)~'R’V"!r. This representation of prior information was intro- 
duced by Theil and Goldberger (1961); recall the similar development for actual 
data in Exercise 2.1.3. 


The following extension of the idea of a conjugate family of prior densities will 
prove useful in subsequent work. 


Definition 2.3.2 In the data density p(Yr |04, A) let 0%, = (0⁄1, 0'42) and 
fix 04. = 0°. Suppose that the data density p(Yr | 64), 092, A) has the r* x 1 
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sufficient statistic vector s} = s} (Yr; O° A), that r* is fixed as T varies, and 
(sž)ı = T. Denote the corresponding partial likelihood function 


LO 415 O49, 872, A) = LO 41; 095, Y2, A) x pS | O41, 0%, A). 


Then the conditionally conjugate family of prior densities with respect to p(Yr | 
841,042, A) is (pai | V4, O42, A). y4 € T4}, where 


POar | y4, O55, A) « LO a1; 055, 9%, A) =k Oar lyž, 095, A) 


and 
r = {ys f kai |y% 02 A)dv@a1) < oof. 
Oar 


Example 2.3.2 Conditionally Conjugate Prior Distributions in the Normal Lin- 
ear Model In Example 2.1.2 the prior distribution (2.11)—(2.12) for the parameter 
vector 6’, = (6’,h) was indexed by y, = {8, H, s”, v}. In Example 2.2.1 it was 
seen that sy = (T, b, s?, X'X) is a sufficient statistic because 


ply’ | B,h, X, A) x h"/? exp{—hls? + (B—b)'X’X(B—b)]/2} (2.45) 


where s? = (y? — Xb)' (y? — Xb). For h = ho, Example 2.3.1 provides the condi- 
tionally conjugate prior density. Conditioning on B = By and employing (2.42), 
we obtain 


p(y? | h, B = Bo, X, A) œ h?’ exp(—s*h/2) 


where 3? = s? + (Bo — b)’X’X(B,) — b). Hence the prior density (2.15) is condi- 
tionally conjugate. It corresponds to a notional sample of v — 2 observations of 
e,  N(O, h!) in which YE? e? = s?. 

Example 2.3.3 The Conjugate Prior Distribution in the Normal Linear Regres- 
sion Model The prior density (2.11)—(2.12) is conditionally conjugate but not 
conjugate. To obtain a conjugate family of prior densities, regard (2.45) as a density 
kernel in h and ĝ, and seek to determine its form. Observe that 


n ply’ | B, h, X, A) dB œ h”? exp(—hs?/2) 
R 


f exp[—h(B—b)'X'X(B—b)/2] dB 


= h"? exp(=hs?/D Qr) Ph Xx" 


œ h?—/? exp(—hs?/2), (2.46) 
and that the kernel of (2.45) in B is 


exp[—h(B—b)'X'X(B—b)/2]. (2.47) 
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Since (2.46) is the kernel of a chi-square density in s*h and (2.47) is the kernel of 

a multivariate normal density in £, it follows that the conjugate prior distribution 
can be represented 

sh|A~ x), (2.48) 

B | (h, A) ~ N(B, hH’). (2.49) 

The combination (2.48)-(2.49) is a specific instance of a normal-gamma prior, so 

called because the distribution (2.49) is normal, and because any gamma distribu- 


tion for h can be written in the form (2.48). 
The corresponding posterior density kernel is 


lanj”? hT ty-2)/2 exp(—s?°h/2) 
- exp{—h[(B — B)'H(B — B) + (y’—XB)'(y’—XB)]/2}. (2.50) 
From (2.42) the term in brackets in this expression is 
s? + (B—B)'H(B—B) + L'HE + b'X’'Xb — B HB, (2.51) 
where 
H=H+X’X and B=H (Hf + X’Xb). (2.52) 
[Note that the definition of H in (2.52) is not the same as H in the model with a 
conditionally conjugate prior distribution (2.19).] The posterior density kernel in B 
alone is exp[—h(B—B)'H(B—B)/2], whence £ | (h, y’, X, A) ~ NIB, (AH)~']. Let 
Q* = s? + B'HB + b'X'Xb — B HB, (2.53) 
and then substitute (2.53) in (2.51) and (2.51) in (2.50): 


p(B, h|y’, X, A) «pote 9? 
-exp{—h[s” + (B—B)'H(B—B) + Q*1/2}. (2.54) 


Integrating this expression with respect to B, we obtain 


pth | y?, X, A) œ hTH2-2)/? (der yk/? ||” expl—h(s? + O*)/2] 


ax he? exp[—h(s? + O*)/2], 


and so 
Fh | (y?, X, A) ~ x° 0). (2.55) 
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where 
7 =s?+O* andV=v+T. (2.56) 


To get some insight into Q*, a little manipulation of (2.53) yields 
Q* = s° + (P —b)'X’X(B — b) + (B — B)' HE - P). (2.57) 


The first term in this expression is the sum of squared residuals, which clearly is 
informative for h—in fact, T/s? is the maximum likelihood estimate of h. The 
last two terms in (2.57) measure the distance between the prior mean # and the 
least-squares vector b. Note that if B = b, then B = b as well and these terms 
vanish. The second term in (2.57) is the distance between b and B using a metric 
proportional to data precision AX’X, and the third term is the distance between 
B and B using a metric proportional to prior precision NH. If these distances are 
large, then in the context of the model the explanation is that h is small. A larger 
value of (B — b)’X’X(B — b) or (B — B)'H(B — B) contributes to a larger value of 
Q* and hence a smaller value of h by means of (2.55). 

Note that we may also integrate (2.54) with respect to h to obtain p(B | 
y’, X, A). The kernel of (2.54) in h is that of the distribution 


[s* + (B—B)'H(B—B) + O*Jh~ x°(T +k + v). 


Referring to the constant of integration for this distribution [e.g., see (2.15)] and 
integrating (2.54) with respect to h, we obtain 


p(B Ly’, X, A) œ [s? + (BBY HB-B) + TT.. 


The kernel of this expression is that of a multivariate Student-tdistribution (Johnson 
and Kotz 1972, Chapter 37; Zellner 1971, Appendix B.2) with location vector B, 


scale matrix 7H, and T + v degrees of freedom: 
B\(y’.X, A) ~ 1B, (T +p) PHT +). (2.58) 


In Example 2.3.3 the potential for an analytically tractable posterior density 
inherent in conjugate prior densities was realized. The prior density, likelihood 
kernel, and posterior density were all members of the normal-gamma family. This 
commonality of distribution families generalizes to a much wider class of observ- 
ables distributions. 


Definition 2.3.3 The exponential family of distributions consists of all distri- 
butions with the observables density 


T r T 
Pr | 64, A) =[g@a)l" it ro) exp [Denen È mo || , 


t=1 i=l t=1 


where @, D 04, f(-) and {c;, Qi), hi(-)}}_, are all specified. 
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Examples within the exponential family include the normal, Bernoulli, Poisson, 
and exponential distributions. A sufficient statistic is 


T 
k DY nily?)G =1,. 9] l 
t=1 


The conjugate family of prior densities is 


Pa | ya A) «[g@a)]”! exp p ctor , (2.59) 


i=1 


ya cra = lr, ; [ [g(04)]"' exp B cr db, < ~| : 


i=l 


The posterior density of 64 has kernel 


r T 
[8(04)] +7 exp ps ci (Oa) [re $ || : (2.60) 


i=1 t=1 


Thus the conjugate prior may be interpreted as a set of y, observations, with 
sufficient statistics y;(i = 2, ...,r + 1). The multiplicative interaction between the 
prior and the likelihood function is of precisely the same form as the multiplicative 
interaction between successive observations. Consequently it is a simple matter to 
combine prior information of this form as well. Specifically, given n independent 
experts with prior density kernels 


[g(O.)I" exp [E eseni] Gah 


i=l 
the joint prior density is of the form (2.59) with y; = Via yili =1,...,r+1). 


Exercise 2.3.1 Some Conjugate Prior Distributions In each case find the con- 
jugate prior distribution corresponding to the observables distribution, and provide 
a “notional data” interpretation for the prior: 


(a) y @ =1,..., T) is iid. uniform on (6), 62). 
b) y $ NO, 03) @=1,...,7). 
(c) y ig: HN(u,1)(t=1,..., T), where HN is the half-normal distribution 


pO: | u) = (m /2)7 "P expl- (y: — W?/2] + Iu.) 9). 


(d) y, ' Poisson(8); that is, P(y, = j 10) = exp(—0)0//j! (j =0, 1,2,.... 
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Exercise 2.3.2 A Complete Uniform Distribution Model (This is a continua- 
tion of Exercise 2.2.2.) The observables y,,..., yr are independently distributed, 
each with a uniform distribution on the interval [0, 0]. 


(a) Show that p(@ | A) x @7”'If,, o0)(@) is the kernel of the family of conjugate 
prior densities for 0. Express the properly normalized prior density function. 
What is the set 4 of permissible values of (y1, y2)? 

(b) Express the posterior density (not merely the kernel) corresponding to the 
likelihood function [derived in Exercise 2.2.2(a)] and the prior density in (a). 


(For continuation, see Exercise 2.5.1.) 


Exercise 2.3.3 Models for Positive Observables (This is a continuation of 
Exercise 2.2.1.) The observables y, are i.i.d. and strictly positive. In model A 


PO: | 9, A) = 8 exp(—O yr) L0,00) (M1), 


while in model B 
PO: |h, B) = (r /2)7'*h'? exp(—hy? /2) (0,00) 9). 


(a) Derive the conjugate prior densities p(O | A) and p(h | B). Make sure that 
the densities are properly normalized—that is, they should integrate to one 
over the relevant range (but you do not need to demonstrate that fact). In 
each case, if the prior distribution is from a common parametric distribu- 
tion family (e.g., normal, uniform, ...), name the family and indicate the 
parameter(s). 


(b 


wa 


Express the posterior density kernel for each model. If either posterior dis- 
tribution is from a common parametric distribution family, name the family 
and indicate the parameter(s). If possible, indicate the posterior mean and 
variance of the single unobservable in each case. 


(For continuation, see Exercise 2.4.6.) 
Exercise 2.3.4 Completing the Argument Derive (2.57) from (2.53). 


Exercise 2.3.5 Uniform Distribution on the Centered Disk Suppose that the 
2 x 1 random vectors y; are independently and uniformly distributed on a disk 
centered at (0, 0) with radius r: 


p@: |r, A) =r rsy) @ = 1,...,T) 


where S(r) = {(y1, y2) : y? + ye < r°}. 
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(a) Find a sufficient statistic in a model with this observables density. 

(b) What is the maximum likelihood estimate of r? 

(c) Find the family of conjugate prior distributions for r. 

(d) Let q =r7! and suppose that the prior distribution of q is s*q ~ x7(v). 
What is the posterior distribution of q (or r)? Be as specific as possible. 


Exercise 2.3.6 Interval Data Suppose that 91, ..., Yr oN (u, h7!) but none 
of the ¥, are observable. Instead, we observe 


I if <c 
Y= 2 if <<. 
3 if re 


The constants cı and c3 are known. (This kind of problem can arise in survey 
data when respondents are asked to provide intervals. For example, respondents 
are generally more willing to indicate that their income is within a bracket than 
they are to provide actual income.) 


(a) What is the vector of unobservables in this model? 


(b) Write the likelihood function for u and h. Provide a 3 x 1 sufficient statistic 
vector. 


(c) What is the conjugate prior distribution for u and h? 
(Exercise 6.2.2 is related, and extends this exercise.) 


Exercise 2.3.7 Spells of Employment and the Exponential Distribution An 
economic consultant for a fast-food chain has been given a random sample of the 
chain’s service workers. In her model, A, the length of time y between the time 
the worker is hired and the time a worker quits has an exponential distribution 
with parameter 6~', p(y | 0) = 0 exp(—0y). For the purposes of this problem, 
assume that no one is ever laid off or fired. The only way of leaving employment 
is by quitting. You can also assume that no one ever has more than one spell of 
employment with the company—if they quit, they never come back. 

In this problem we consider the case in which the consultant’s data consist 
entirely of “complete spells;” that is, for each individual t in the sample the 
consultant observes the length of time, y,, between hiring and quitting. 


(a) Express the joint density of the observables and find a sufficient statistic 
vector for 0. 

(b) Show that the conjugate prior distribution of @ has the form s76 ~ x? (w), 
and provide an “artificial data” interpretation of (s7, v). 

(c) Using the prior density in (b), express the kernel of the posterior density. 
Show that the posterior distribution of 6 has a gamma distribution of the 
form 5°60 ~ x7). Express 3° and 7 in terms of the sufficient statistics from 
(a) and (y1, Y2) from (b). 
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Exercise 2.3.8 Censored Spells in the Exponential Model In the same situation 
as Exercise 2.3.7, suppose instead that the consultant’s data do not consist entirely 
of complete spells. Instead, the consultant collects data by gathering a random 
sample of the records of everyone hired in the first quarter of 1998. Because of the 
limitations of her budget, she can gather records only through the end of 2000. T; of 
the individuals in her sample quit before the end of 2000, and for these individuals 
she knows the length of the spell of employment y,. T2 of the individuals in her 
sample were still working for the fast-food chain at the end of 2000, and for these 
individuals she has only a lower bound z, on the spell of employment. 


(a) Express the joint density of the observables in this situation. 

(b) What is a vector of sufficient statistics for 0? (Do better than the trivial 
answer yy,..., YT Z1» een.) 

(c) Is the prior distribution from Exercise 2.3.7(b) still conjugate? If it is, pro- 
vide an “artificial data” interpretation of the distribution. If it is not, find the 
conjugate prior distribution in this situation. 


[This exercise, and Exercise 2.3.7, are examples of Bayesian survival analysis 
for which there is a substantial literature. The simulation methods developed in 
Chapter 4 were first applied to survival analysis by Dellaportas and Smith (1993). 
For applications in economics, see DeJong (1993) and Campolieti (2000, 2001).] 


2.4 BAYESIAN DECISION THEORY AND POINT ESTIMATION 


The elements of Bayesian decision theory are isomorphic to those of behavior under 
uncertainty in economics. The connection was first developed in a series of papers 
by Friedman and Savage (1948, 1952) and Savage (1951), and classic expositions 
of this and related work are Berger (1985) and Pratt et al. (1995). Both Bayesian 
decisionmakers and economic agents associate a cardinal measure with all possible 
combinations of random elements in their environment that they cannot control, 
and those elements that they do control. The latter are called “actions” in Bayesian 
decision theory and “choices” in economics. The mapping to a cardinal measure 
is a loss function in the former and a utility function in the latter, but except for 
a change in sign they serve the same purpose. The decisionmaker takes the Bayes 
action that minimizes the expected value of his loss function; the economic agent 
makes the choice that maximizes the expected value of her utility function. The 
formal setup for the Bayesian decisionmaker is as follows. 


Definition 2.4.1 The elements of a Bayesian decision problem are an action 
a € A C R” controlled by the decisionmaker, a loss function L(a, œ) depending 
on the action and a vector of interest œ € Q C R1, and a distribution function P 
for w. The objective of the decisionmaker is to minimize the Bayes risk function 


R(a) = E[L(a, @)] =i L(a, @) p(@) dv(@). 
Q 
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The decision problem originates in the need to choose a. The loss function is the 
criterion for choosing a. It specifies the vector of interest œ, which in turn suggests 
the data and models the decisionmaker may wish to use. These data and models 
dictate the form of P, which need not be specified at this level of generality. 


Definition 2.4.2 If the Bayes risk function R(a) exists for all a € A, then any 
solution @ = arg minaca R(a) of the Bayesian decision problem is a Bayes action. 
The associated Bayes risk is R(a). 


In application the relevant conditioning set is the information available at the 
time the action a is taken. This might be when no data are available (as when an 
experiment is designed), when some data are available (as when deciding which 
experiment, if any, to conduct next), or when data are being observed regularly (as 
in forecasting situations). The term Bayes risk is sometimes confined to the case of 
a prior distribution [see, e.g., Berger (1985), Section 1.3]. The broader definition 
is adopted here because of its utility in applied Bayesian decisionmaking. 

The random vector œ in Definitions 2.4.1 and 2.4.2 is the vector of inter- 
est identified at the end of Section 1.2. In the class size example introduced in 
Section 1.1.1, the vector of interest w is the average test score in the school dis- 
trict. If T is the number of teachers in the school district, § is the number of 
students, c is the cost of each teacher, and the school district places the value 
d on each test point for each student each year, then the loss function might be 
L(T, w) = cT — d Sw. The relationship between class size and test scores creates a 
link between T and w. Uncertainty about the link is reflected in the distribution of 
œ conditional on T. Such problems can be solved routinely using the simulation 
methods set forth in Chapter 4, and we return to this problem with such methods 
in hand in Example 5.1.2. 

Nevertheless, simple loss functions teach a great deal about the structure of 
Bayesian decision problems. In particular, three solutions are often applied by 
investigators who do not expressly articulate the loss function that supports the solu- 
tion. The critical client is then well served by examining whether her loss function 
is well approximated by the one that the investigator has assumed implicitly. 


Definition 2.4.3 The loss function L(a, œ) is a quadratic loss function if 
L(a, œ) = (a — œ)'Q(a — øw), (2.61) 
where Q is a positive definite matrix. 


This definition is more general than it might first seem; any second-order poly- 
nomial in a and @ can be brought into the form (a — @)'Q(a — œ) plus a random 
term that is unaffected by a. If Q is positive definite, then the loss function is 
quadratic. Note the symmetry in the loss function of Definition 2.4.3; for a given 
action a the realized loss is the same if the outcome is œ =a + ô or œ = a — ô. 
An actual application may or may not be well served by this characteristic, and 
this point should be examined before proceeding with a quadratic loss function, 
which leads to the following simple implication. 
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Theorem 2.4.1 Bayes Action with a Quadratic Loss Function If the loss 
function is (2.61) and A = R4, then the Bayes action is a = E(@) and the Bayes 
risk is R(a) = tr[Q var(@)]. 


Proof: Note that 
R(a) = E[L(a, @)] = i, (a — w)'Q(a — @) p(@) dv(@) 
Q 
is twice differentiable: 


dR(a)/da = af Qa — w) p(w) dv(w), 
ə’ R(a)/ðaða' = 2Q. 
These conditions imply that’ = E (w) is the unique Bayes action. The Bayes risk is 
E(a—o)'Q(a—o) = trE[Q(a—@) (a—@)’] 
= trE{Q/@—E(@) |[o—E(@)]'} = tr[Q var()]. 7 


This result is strong and perhaps surprising in two dimensions: (1) the action 
is the same no matter what the positive definite matrix Q—a change in Q will 
affect Bayes risk but will leave the Bayes action unchanged; and (2) it is only the 
mean of @ that matters for the Bayes action. Other properties of the distribution 
are irrelevant beyond the fact that variance must exist for the problem to be well 
defined. Second moments matter for Bayes risk, but not for the choice of a. 

Note that if œ were replaced by its mean, which we can denote @, in the quadratic 
loss function, then L(a, ©) = (a—@)'Q(a—@). The loss-minimizing action is 
‘a = Ə, which is also the solution of the actual Bayesian decision problem. This fact 
is sometimes referred to as the certainty equivalence principle, after Simon (1956), 
who introduced it in a more general context. A corollary is that if only means are 
reported by an investigator, then the application of the investigator’s findings is 
effectively restricted to decisions well characterized by quadratic loss. This may or 
may not be a reasonable limitation. It depends on the application at hand. 

Because of their symmetry, quadratic loss functions are especially inappropriate 
if the consequences of an action being “too high” are quite different in severity, as 
will be the case if an action is similarly “too low.” This asymmetry is characteristic 
of many situations that are described as “risky” in the colloquial use of that word. 
The following loss function explicitly incorporates asymmetry, for the case of an 
action with a single dimension. 


Definition 2.4.4 Ifae AC R ando €Q C A, the loss function L(a, œw) is a 
linear—linear loss function if 


L(a, w) = (l — q)(a — @) I-00, (w) +q (w — a) Lla, (@) (2.62) 


where q € (0, 1). 
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To take a familiar example, consider an individual with modest savings who 
is about to retire. Let a represent the individual’s postretirement economic stan- 
dard of living and the return on her savings. If w > a, she will find her wealth 
accumulating, whereas if w < a, she will eventually become destitute. In an appli- 
cation of a linear—linear (or “lin-lin”) loss function, it would be the case that 


1 


Theorem 2.4.2 Bayes Action with a Linear—Linear Loss Function If the 
loss function L(a, œ) is (2.62), the random variable w is absolutely continuous and 
A = R, then the Bayes action is 


a={a:P(w<a)=q} (2.63) 
and the Bayes risk is 
R@) =q0—q)-[E@-0|0<@+EW-G| o>). 


Proof: To verify the solution note that 
a Co 
R(a) = E[L(a, w)] = (1 — of (a — w) p(w) dw + af (w — a)p (w) do. 
= a 


This function is twice differentiable: 


dR(a)/da = (1 - q)P (œ < a) - qP (w > a), 
&@ R(a)/da? = p(a). 


The first-order condition implies 
A -q4)Pw <a) =q[1 - Pw <a)] & Pœ <a) =q. (2.64) 


If p@) > 0 for @ satisfying (2.64), there is a unique Bayes action, and if not, then 
@ is set valued. Substituting a = @ in (2.62) and taking the expectation, we obtain 


R@ =(1-QEl@-o) |o <P <3) 
+qgE[(o—@ | v > P(o >@) 
=q(1—qy{EI@-0) | v < î] + Ello -3 | v > A}. 7 


The Bayes action @ is the qth quantile of the distribution of œ. If q is small 
(large), then the loss if w is larger (smaller) than a is relatively small (large), and 
so the Bayes action is small (large) relative to typical realizations of w. Note the 
implication of Theorem 2.4.2 for our retiree: because q < $, she will choose a 
standard of living, a, less than the median of w (which would be the Bayes action 
if q = 5). The Bayes action for the linear—linear function provides some insight 
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into the structure of the value at risk problem introduced in Section 1.1.2. In this 
problem w is the value of the portfolio in the future period t*. For any quadratic 
loss function L(a, œw) = (a — y4q, the solution is a = E(w). Such a loss function 
is inappropriate if the decisionmaker is concerned about the risk undertaken by 
institutions with fiduciary responsibilities to preserve capital. A loss function of 
the form (2.62) would be more appropriate, and if we take q = .05, then the Bayes 
action is the 5% quantile of the distribution of w. 

It is common to see reports of “most probable” or “most likely” values of 
random variables. One of the attractions of these solutions is that they are easy to 
compute. Unlike the quadratic and linear—linear loss functions, no integration of 
probability densities is required. The formal rationale for this action rests on some 
rather strong requirements. 


Definition 2.4.5 If the distribution of w is absolutely continuous, the loss func- 
tion L(a, œ; £) is a zero—one loss function if a € Q and 


L(a, œ; £) = 1 — Iv,(a)(@), 
where N,(a) is an open £ neighborhood of a. 


Theorem 2.4.3 Bayes Action with a Zero—One Loss Function Suppose that 
p(@) is a continuous function with a unique mode at œ =a, and A = R4. Let a(e) 
be the Bayes action for a zero—one loss function. Then lim,_,9a(e) = @. 


Proof: Given any 6 > 0, let Q; = {w : p(w) > p(a) — ô}. For ô sufficiently 
small, Q 5 is an open neighborhood of the mode @. There exists ¢* > 0 such that if 
e < e*, then N,(a) C Qs. Hence Ve < e*, N,(a(e)) N Qs; Æ {Ø}. (Why?) Since ô 
can be arbitrarily small, @ must be a limit point of N,(a(e)). C] 


Of course, if the mean and the mode of œw are the same, then quadratic loss 
and zero—one loss lead to the same Bayes action. This action will also result if in 
addition the distribution of œ is symmetric about its mean, and linear—linear loss 
applies element by element with g = 5. It is rare to find modes reported without 
at least an implicit appeal to unimodality, symmetry, or both. The solution of the 
zero—one loss Bayesian decision problem is appealing because of its computational 
simplicity rather then its approximation of common actual loss functions. 

As a practical matter, attention need not be confined to these or other loss func- 
tions because they lead to analytically simple solutions. Example 5.1.2 illustrates 
the use of a realistic but analytically intractable loss function, together with a pos- 
terior simulator, to find the Bayes action. For an application in marketing, see Rossi 
et al. (1996). 

Bayesian point estimation is a Bayes action corresponding to a loss function in 
which the vector of interest w is the vector of unobservables 04. In the usual setup 
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in which the relevant distribution is the posterior, the Bayes action is the point 
estimate 


64 = argmin f L(04,04)p(O4 | y’, A) dv(Oa). 
A Oa 


Bayesian point estimation differs fundamentally from non-Bayesian approaches 
to estimation that seek a rule, or estimator 0% (y), to minimize 


[ LIO Y), 8alply | 84, A) dv(y). 


In the solution of this problem 04 nearly always depends on @,4 as well as on 
y, and consequently #4 is almost never feasible. There is no single principle for 
eliminating the dependence of 04 on 04, and since this can usually be done in a 
number of reasonable ways there is often a proliferation of non-Bayesian estimates 
in any particular application. By contrast, given a complete model and a loss 
function, the Bayesian point estimate is well defined as long as E[L(04, 04) | 
y°, A] is well defined and finite for at least some 04. 

The solution of Bayesian decision problems in the three specific cases just 
considered carries through directly to the specific case of point estimation of the 
vector of unobservables 04. A quadratic loss function L(0 4,04) leads to the pos- 
terior mean 04 = E(04 | y°, A). The linear—linear loss function for a parameter 
selects the posterior gth quantile. The limiting zero—one loss function leads to the 
mode of p(@4 | y°, A) when applied to the entire vector 04. 

Point estimation of parameters is heavily emphasized in statistics — perhaps more 
so in the non-Bayesian than the Bayesian literature, but the latter also positions point 
estimation prominently. This nearly always reflects the evolution of technology 
rather than the underlying decision problem. The parameter vector 6 4, in the context 
of a complete model, provides expression of 


p(@ | A) = P@ | 04, A)pOa | A) dv@a) 
O4 
in a way that facilitates Bayesian updating with data: 


plø | y°, A) -f pæ |04, A)pOa | y°, A) dva). (2.65) 


Oa 


Even technically oriented decisionmakers have little use for parameters, which are 
intermediate devices of interest to investigators. Occasionally some components of 
@ correspond to certain components of 0 4, but this emphasizes that it is the vector 
of interest w and not the parameter vector 04 that matters in the decision. It is 
never the case that 


plo | y’, A) = p(@ | 64, A), (2.66) 
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an assumption commonly made in non-Bayesian statistics. There may, of course, 
be cases in which the difference between (2.65) and (2.66) is negligible, but this 
again enforces the point that the governing rule is (2.65) rather than (2.66). The 
burden is to demonstrate the closeness of (2.66) to (2.65) if we are to use (2.66). 

To illustrate the techniques involved in analytic approaches to Bayesian point 
estimation, and provide some more insight into the normal linear regression model, 
consider the tasks of estimating B and h, respectively. 


Example 2.4.1 Estimation of 6 in the Normal Linear Regression Model In 
the normal linear regression model (2.10)—(2.12) 


Bih, y’, X, A) ~ NG, H). 


(Throughout this example we condition on h as well as the data.) Given any 
quadratic loss function, B = E[B | h, y°, X, A] = B. Given a zero—one loss func- 


tion, 6 = 6, as well. Given the linear—linear loss function 
L(B;, Bj) = 1 — q)(B;j - Bj)I(-c0p,) (Bj) +q4(;— B)1G,,00) (Bj), 
B; =B, + aa), 


where h” denotes the ( j, j)th entry of H`’ and ®~! is the inverse cdf of the 
standard normal distribution. If q = $, then £; = B j- Clearly the equivalence of the 
three estimates is driven by the unimodality and symmetry of the posterior density 
of 8, and will emerge whenever the posterior pdf of the parameter estimated is 
unimodal and symmetric. 


Example 2.4.2 Estimation of h, o? = h~!, and o = h7"? in the Normal Linear 
Regression Model In the normal linear regression model (2.10)—(2.12) 


3°h | (B, Y°, X, A) ~ x70), 


where 5° and 7 are as defined in (2.26). Since the mean of a chi-square random 
variable is its degrees of freedom, h | (B, y’, X, A) = 7/5 if the loss function is 
quadratic. From (2.14) the mode of the chi square pdf is its degrees of freedom 
less two (or else zero, whichever is larger), and so the zero—one loss estimate is 
h | (B, y’, X, A) = (0 — 2)/S” if D> 2. 

Through the same change of variable to o? = h~! undertaken in the prior leading 
to the pdf (2.16), we have 


pio? | B, y°, X, A] = [ZPT O/D] E (67)? exp(-3"/207). (2.67) 
E[o? | (8, y°, X, AJ = 3/0 — 2) if T > 2, but estimation of o? under quadratic 


loss has no solution if V < 2. The mode of p[o? | (B, y°, X, A)] occurs at r/o + 
2), which is therefore the zero—one loss estimate of øo? in this model. 
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If we transform instead to the standard deviation o = h~'/?, by the usual change 
of variable 


plo | B, y’, X, A] = ROP TODI E) oT exp(—s"/207). (2.68) 


The corresponding posterior mean and Bayes estimate under a quadratic loss func- 
tion are 


a ZTE- D/A ODPTE- D2 4.4% 
o= — Ta = — Fay & m). (2.69) 


With a zero—one loss function the Bayes estimate of ø is the posterior mode 
5/0 + 1)!. 

In all three cases there is no simple closed-form expression for the median of the 
posterior distribution, which is in turn the estimate under a symmetric linear—linear 
loss function. But it is easy to compute the estimates, given 5” and7, using standard 
software for the inverse of the cdf of a chi square random variable. Here are some 
values in the case of A: 


5 v Mean Mode Median 
5 5 1.000 0.60 .8702 
20 20 1.000 0.90 .9669 


100 100 1.000 0.98 .9934 


Note that for a fixed value of 5”/V, the estimates converge as Y increases. This 
is due to the fact that the distribution of a normalized chi square random variable 
approaches the standard normal distribution as V increases, and the normal pdf is 
symmetric about its mean. Also note that the mode is always below the median 
and the mean is always above, and that the distance between the mode and the 
median is about twice that between the mean and the median. 


Exercise 2.4.1 Generalizing the Quadratic Loss Function Consider the 
weighted squared-error loss function 


L(@, w) = w(@)(@ — &)'Q(@ — w) 


where Q is a positive definite matrix and w(@) > 0 Y w € Q. Show that the cor- 
responding Bayes estimate of œ is 


El@w(@) | y°, A] 


© = ; 
E|w@) | y°, A] 
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Exercise 2.4.2 Generalizing the Linear—Linear Loss Function In this exercise 
the distribution of w need not be absolutely continuous. This implies that @ defined 
in (2.63) need not exist. However, if (2.63) is replaced by 
a@={a:P(w<a)>q,P(@w>a)>1-4q} (2.70) 
then @ will always exist, although it still need not be unique. 
(a) Under what conditions will the loss functions 
L(a, w) = (1 — qi)(a — w) lwa) (@) + qı (@ — a) la, œ) (@) 
and 
L(a, w) = (1 — g2)(a — o)l, (@) + (o — a) a,00)(@), 


where qı € (0,1) and q2 € (0,1) but qı Æ q2, lead to the same Bayes 


action @? 
(b) The risk function can be defined with respect to the probability measure P 
of the random variable w as 
a CO 
R(a) = (1 — of (a — w)dP (w) dæ + af (w —a)dP (Œw) doa. 
—0o a 
Show that the right derivative of R(-) is 
dR/dat = (1—q)P(@ < a) —qP(w > a) 
and the left derivative is 
dR/da =(1-—q)P(@ < a)— qP (% > a). 
(©) Show that if P(@ < a) < q, then dR/dat < 0, and if P(@ >a) < 1-4, 


then dR/da~ > 0. Conclude that @ defined in (2.70) is the set of Bayes 
actions. 


(d) Under what conditions is @ unique? 


Exercise 2.4.3 Linex Loss Function Zellner (1986a) proposed the linear— 
exponential (or “linex”) loss function 


L(a, w) = exp[r(a — w)] —r(a-—o@) — 1, 
where r Æ 0. 


(a) Show that the Bayes action is @ = —r~! log{E[exp(—rw)]}. 
(b) Show that if, in addition, œ ~ N (u, h™!), then @ = u — (r/2h). 
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Exercise 2.4.4 Properties of Estimates Let 0’, = (0^1, 0/45). 


(a) Consider two cases. In both cases, the loss function is quadratic. In the first 
case it is of the form L(04, 04), but in the second case it is of the form 
L(0 41, 9,41), but is still quadratic. Is the Bayes estimate of 0,4, the same in 
the two cases? 

(b) Consider two cases. In both cases, the loss function is zero—one. In the first 

case it is of the form L(04, 04), but in the second case it is of the form 

L(6 41, 9,41), but is still zero—one. Is the Bayes estimate of 0,4; the same in 


the two cases? 


ma 


Exercise 2.4.5 Point Estimation for the Lognormal Distribution Suppose 
log) | (u, h, A) $ (who!) (t =1,..., T). 

(a) Beginning with the pdf of the univariate normal distribution, derive and 
express the pdf of y in terms of u and h. Then derive E(y, | u, h, A). In 
doing this you may find it convenient to use the moment generating function 
for z ~ N (u, h™!), which is E[exp(tz)] = exp(tu + t? /2h) 

(b) Here, and for the rest of this exercise, suppose that h is known but u is not, 
and u | A ~ N(u, h™!). Derive the posterior distribution for m. 

(c) Find the Bayes estimate © of w = E (y, | u, h, A), given a quadratic loss 
function. 

(d) Find the Bayes estimate © of w = E(y; | m, h, A), given the loss function 
L(@, w) = |o — Ol. 

(e) Find the Bayes estimate © of wœ = E(y; | u, h, A), given a zero—one loss 
function. 


Exercise 2.4.6 Models for Positive Observables Recall the exponential 
model in Exercise 2.3.3. The observables y,,...., yr are i.i.d., p(y, | 0, A) = 
0 exp (—Oy,)I(0,00) (91). Suppose w = yr+ı, which is (as yet) unobserved, inde- 
pendent of the observed yı = y/,..., yr = y7, and has the same distribution as 
each of yı, ..., yr. Find the estimate © of œw implied by the loss function 


© — w) I-04) (@) + 3(@ — D)o, (@). 


Make your answer compact, and use conventional notation to the extent you can. 
(For continuation, see Exercise 2.6.2.) 


Exercise 2.4.7 Decisionmaking under Uncertainty In deciding whether to per- 
mit a merger of two large firms selling the same product, a government regulatory 
body (GRB) considers the change w in the price of the product that will occur 
following the merger. The GRB can take the action a = 1 (permit the merger) or 
a = 0 (forbid the merger). Its loss function is L(a, w) with L(0, œ) = 0. The GRB 


; i : : : ; — +l 
is uncertain about w, but given all available information, œw ~ N(@, h, ). 
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(a) Suppose that the GRB’s loss function is L(1, œ) = (w — w} + g(w — a*)’. 
In this loss function, w* > 0 and g > 0. What does the loss function express 
about the GRB’s attitude toward price changes following the merger? 


(b) Continuing to assume the same loss function as in (a), express the GRB’s 


decision rule in terms of @, ho, w*, and g. 

(c) Now suppose instead that the GRB’s loss function is L(1, œ) = exp(f@) — 
b. In this loss function f > 0 and b > 0. Does this loss function express 
attitudes toward price change similar to those for the loss function in (a)? 

(d) Continuing to assume the loss function in (c), express the GRB’s decision 


rule in terms of ©, heo, f,and b. 


(e) Suppose that the GRB’s staff reports © but not he. Would this matter in 
part (b)? In part (d)? 


2.5 CREDIBLE SETS 


A credible set conveys one aspect of uncertainty about a vector of interest œ € Q 
with pdf p(w). It is a mapping from the distribution of œ to a subset of & containing 
æ with given probability. 


Definition 2.5.1 A set C C Q such that 
P@meC)= i, p(@)dv(@) =1-—a (2.71) 
c 


is a 100(1 — a)% credible set for œ with respect to p(@). 


If the distribution of @ is absolutely continuous, so that dv(w) = dæ, then C 
must exist, and except possibly in the case a = 0, C will not be unique. On the 
other hand, if œ is a discrete random variable then C will exist only for certain 
values of a. In what follows in this section, for any S C Q, S=Q-S, and 
v(S) = f; dv(@). 

A posterior credible set is defined using (2.71) and p(w) = p(@ | y’, A). It 
differs fundamentally from a non-Bayesian confidence region. The latter is a set 


R(y) C Q such that 
Plw € R(y) |04, A] 


= 1 p P@|y, 9a, Mavo) Ply | 4, A)dv(y) = 1—a. 
Y LY R(y) 


With the exception of a few elementary cases, it is generally impossible to find an 
expression for P[w € R(Yr) |04, A] that does not involve the unobservables 64. 
The problem is essentially the same as that arising for non-Bayesian point estimates. 
Even when these problems can be solved, non-Bayesian confidence regions easily 
can lead to awkward results. 
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Example 2.5.1 A Simple Confidence Interval Suppose that {y,}/_, are inde- 
pendently and uniformly distributed on the interval [0 — .5, 0 + .5], and T = 25. 
A 95% confidence interval for 0 is [0 — .056, 0 + .056], where 


@ = [min(y,) + max(y;)]/2. 
Consider case 1: 
min(y;) = 3.10, max(y,) = 3.20, R = [3.094, 3.206], 
and case 2: 
min(y,) = 3.00, max(y,) = 3.96, R = [3.424, 3.536]. 


Both results defy common sense. In case 1, @ could be as small as 2.7 and as 
large as 3.6, and in case 2, it is certain that 0 must be in the interval [3.46, 3.50]. 
In both cases the difficulty is that it is intuitive to condition on the data actually 
observed, whereas this non-Bayesian procedure provides intervals that include 0 
with probability .75 in hypothetical repetitions of the experiment of collecting 
samples of size T = 25. 

To construct credible sets, introduce the prior distribution 0 ~ N (0, h7'). Then 
100% credible sets are [2.7, 3.6] in case 1 and [3.46, 3.50] in case 2. As h > 0, 
95% credible sets are [2.7225, 3.5775] in case 1, and [3.461, 3.499] in case 2. In 
general, credible sets are not hard to determine for any given values of @ and A. 


Credible sets are invariant under transformation. If 
P@EC)= f p(@)dv(@) =1-a, 
Č 


f: 2> Q is one-to-one, ® = f(@), and C is the image of C under f, then 
P(© € C) = 1 — æ. Clearly credible sets are not unique, in general. For example, 
if the posterior distribution of œ is absolutely continuous and a > 0, then there are 
uncountably many solutions of (2.71) for C. Nor, in general, need a credible set 
of a specified size exist if the posterior distribution is not absolutely continuous. 
An interesting subset of all credible sets is the set of highest probability density 
(HPD) credible sets. 


Definition 2.5.2 A 100(1-a)% highest probability density (HPD) credible set 
for œ with respect to p() is a 100(1-w)% credible set C for œ with the property 
that if w; € C and œ € C then p(@ 1) > p(@2)V@, € C and all & € C. 


The elements of the set of HPD credible sets are often unique up to sets of v- 
measure 0. This is always the case if there exists a function c : (0, 1) —> R* with 
the property that P[w : p(@) > c(a)] = (1 — æ). For example, HPD credible sets 
for multivariate normal distributions are unique and consist of ellipses and their 


58 ELEMENTS OF BAYESIAN INFERENCE 


interiors, like those shown in Figure 2.1. On the other hand, HPD credible sets for 
a uniform distribution are not unique. 

It is natural to cast the choice of a particular credible set from among all possible 
credible sets as a Bayesian decision problem. The HPD credible sets correspond 
to the set of solutions of one such problem. 


Theorem 2.5.1 Optimality of Highest-Density Regions Suppose that the dis- 
tribution of œ is absolutely continuous and p(@) is the pdf of œ. For all C C Q, 
let v(C) be the Lebesgue (ordinary) measure of C. Let C*(a) = {C : P(@ € C) = 
1 — a}. Given the loss function L(C, œ) = kv(C) — Ic(@), with k > 0 and defined 
on [C*(a@) x Q] > R, C is a solution of 


ming E[L(C, @)] (2.72) 


if and only if for all œw; € C and œ € C, P(@1) > p(@2), except possibly for a 
collection of (@;, @2) with probability zero. 


Proof: For any C € C*(œ), E[L(C, @)] = kv(C) — (1 — aœ), and consequently 
the solutions of (2.72) and the problem mincec*(q) v(C) are the same. 

Suppose that for all œ; € C and œ € C, p(@ı) > p(@2), except possibly for 
a collection of (@;, @2) with posterior probability zero. For any other D € C*(a), 
we obtain 


P(@ € C) = P(@ € D) = P(@ e CN D) = P(@ €e CAN D), 


and consequently 


inf _p(@)v(C N D) =) _ p(w) dv(@) 
CAD 


@eCND 


= a p(w) dv(@) < sup p(@)v(CN D). 
CAD 


CND 


By assumption supēnp p(@) < inf,ccnp p(w), and so v(C N D) < v(CND), 
whence v(C) < v(D). 

Now suppose the contrary—that there exist E C C and B C C such that 
@, E€ E, @ € B > p(@2) > p(@ı), and P(w € E) = P(w € B) > 0. Let D = 
(CN E) UB. Then D € C*(a), and by the argument in the previous paragraph 
v(D) < v(C). E 


HPD credible sets are not invariant under transformation, as illustrated 
in Figure 2.2. Figure 2.2a illustrates p(w), with the solid line indicating the 
unique 80% HPD interval. The transformation w* = f(w) = o!/?, illustrated in 
Figure 2.2b, leads to the pdf for w* shown in Figure 2.2d. The solid line in that 
figure indicates the image of the HPD interval from Figure 2.2a. Clearly this is not 
the 80% HPD interval for w*, which is indicated by the solid line in Figure 2.2c. 
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oO 
P 


Figure 2.2. Some credible sets and projections under transformation: (a) HPD region for w; (b) mapping 
into w*; (c) HPD region for w*; (d) œ HPD region mapped into w*. 


More generally, if f :&— Q* is a one-to-one function, then p(@*) = 
PLS ~'(@*)]JLf~!(@*)] where J(@) = |8f/d@|~! is the Jacobian of transforma- 
tion. If p(@1) > p(@2) but p(@1)J(@1) < p(@2)J(@2), then p(w}) < p(@3) for 
oÏ = f(@1), ©} = f(@). 


Example 2.5.2 Highest Conditional Posterior Density Regions in the Normal 
Linear Regression Model In the model (2.10)—(2.12) the conditional posterior 
distribution of £ is (2.23). From the corresponding density kernel (2.22), a highest 
posterior density region is of the form 


{B : (B—B)'H(B—B) < c}. 


Because (8—B)/H(B—B) | (h, y°, X, A) ~ x7(k), a 100(1 — a)% highest condi- 
tional posterior density region for B is 


C = {B: (B—B)'H(B—B) < x2(k)}. 
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In the same model, the conditional posterior distribution of h is (2.25) and 
the corresponding conditional posterior density kernel (2.24) is unimodal. If v = 
T +v > 2, then p(h | B, y’, X, A) is monotone increasing for h < (Y — 2)/3° and 
monotone decreasing for h > (V — 2) I5. Moreover, limy_.9 p(h|B, y°, X, A) = 
lima p(h|B, y°, X, A) = 0. Therefore a highest conditional, posterior density 
region is of the form (c1, c2), with 


pth =c | B,y®, X, A) = p(h = cz | B, y’, X, A) 


and 
C2 
f p(h | B,y®, X, Adh =1-a. 
ci 


If T +v <2, then p(h | B, y’, X, A) is monotone decreasing and the region is of 
the form (0, c) with c = x2(T + v) /s*. A similar analysis may be applied to obtain 
highest conditional posterior density regions for o°, beginning from (2.67), and for 
o, beginning from (2.68). 


Exercise 2.5.1 A Complete Uniform Distribution Model (This is a continua- 
tion of Exercise 2.3.2.) The observables y,,..., yr are independently distributed, 
each with a uniform distribution on the interval [0, 0]. The posterior distribution is 
that corresponding to the conjugate prior distribution, found in Exercise 2.3.2(b). 


(a) Find the 90% highest posterior density interval for 0. 
(b) Find the Bayes estimate @ of 6 given, alternatively 
(i) A quadratic loss function 
(ii) A linear—linear loss function 


L@, 0) = (1 — GO = 9100.) ) +40 — 1G, 00) 0); 
(iii) A zero—one loss function L@, 0,6) =1—Ty,@@). 


Exercise 2.5.2 Credible Sets under Transformation In a complete model sup- 
pose that the likelihood function is L(@,4; y°, A) and the prior density is p(@,4 | A). 
Let y, = f(04), where f is one-to-one. Let L*(y 4; y°, A) be the corresponding 
likelihood function for y4, and let p*(y, | A) be the corresponding prior den- 
sity for y4. Provide either a proof or a specific counterexample for each of the 
following statements: 


(a) If 64 is the mode of L(64;y’, A), then F4 = f@4) is the mode of 
L*(y 4; y°, A). 

(b) If 04 is the mode of p(04 | y’, A), then y, = f (04) is the mode of p(y 4 | 
y’, A). 


MODEL COMPARISON 61 


(c) If R is a 95% highest posterior density region for 04, and S is the image of 
R under f, then S is a 95% highest posterior density region for y 4. 


Exercise 2.5.3 Point Estimates and Credible Sets In a complete model y, | 
(u, A) es N(u, 1) @ =1,..., T) and the prior distribution is u ~ N (0, 4) trun- 
cated to u > 0: p(w | A) = (2a) ~!/2 exp(—p7/8) I0,00) (u). In answering each of 
the following questions, express your answer in terms of the sample size T, the 
observed sample mean yy, and the pdf and cdf @ and ® (respectively) of the 
standard normal distribution. 


(a) What is the Bayes estimate of u given 
(i) A quadratic loss function? 
(ii) A zero—one loss function? 
(iii) A linear—linear loss function with q = .75? 
(b) What is a 90% highest posterior density region for u? 


[Hints: (i) You may need to consider more than one case in each situation, 
depending on the values of y% and T; (ii) a standard result in distribution theory 
states that if x ~ N (0, 1) but is truncated to x > c, then E(x) = ¢(c)/[1 — ®(c)].] 


2.6 MODEL COMPARISON 


Often we must reach a conclusion or make a decision based on several models 
rather than one, and there is a large literature on model selection. The complete 
probability structure introduced in Section 1.5 suggests that model averaging, not 
model choice, is the essence of the problem. This insight goes back at least to 
Jeffreys (1939). Its importance in statistics was recognized by Roberts (1965) and in 
econometrics by Zellner (1971) and Leamer (1978). More recently Draper (1995), 
Chatfield (1995), Kass and Raftery (1995), and Hoeting et al. (1999) have reviewed 
theoretical and practical aspects of Bayesian model averaging. 

For specificity denote the models j = 1,..., J. Model j has unobservables 
vector @,4,, unobservables prior density p(@4, | Aj), observables density p(y | 
0 Ap Aj), and vector of interest density p(w | y, 0 Aj Aj). The J models are related 
by their predictions for a common set of observables y and a common vector 
of interest œ. The numbers of unobservables in the models may or may not 
be the same and various models may or may not nest one another. The vec- 
tor of interest is substantively the same for all models, although its distribution 
is model-specific. The specification of the collection of J models is completed 
with the prior probabilities p(A;) (j =1,..., J) for the respective models, and 
pea p(A;) = 1. There is no essential conceptual distinction between model and 
prior—we could just as well regard the entire collection as a single model, with 
{p(Aj), p@a, | Aj) dvj@a,)} i providing the prior distribution of unobservables. 
At an operational level the distinction is usually clarified by the fact that we may 
undertake the essential computations one model at a time. 
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2.6.1 Marginal Likelihoods 


The density p(@ | y°) is ultimately of interest. The formal solution is 


J 
poly) =X p@|y’, Aj)p(Aj ly’), (2.73) 
j=l 


known as model averaging. In expression (2.73), we obtain 
P(A; | y°) = ply’ | APAPO) & ply’ | Aj) p(Aj). (2.74) 


Thus the weight p(A; | y°) in (2.73) is the product of the model prior probability 
p(Aj;), and the marginal likelihood p(y? | A;) of Definition 2.1.1. As noted there 


Pty? | Aj) =) Ply? |04,,A;)p@a, | Aj) dvj4,)- (2.75) 
94; 


It is important to recognize that if p(y° | @4,, A;) is replaced by the correspond- 
ing likelihood function LO4;; y’, Aj) x ply? | 64, Aj) in (2.75), then, unless the 
constants of proportionality are the same across all models, (2.73) will no longer 
be true. Ignoring this fact can be extremely misleading because omitted constants 
in likelihood functions can vary by many orders of magnitude from one model to 
the next with the same data set. 

The development in (2.73)—(2.75) indicates that the marginal likelihood p(y? | 
Aj) is the key additional component, beyond the analysis of the individual models 
Aj, that model averaging requires. 


Example 2.6.1 Marginal Likelihood in the Normal Linear Regression Model 
with Fixed Precision If h = ho and B | A ~ N(B, H- !), then 


P(B | A)p(y° | B, ho, X, A) 
= 2 1/2 
5 (27) a H| 


-exp{—[ho(y’ — XB)'(y’ — XB) + (B — B)'H(B — B)I/2}.. (2.76) 
Completing the square as in Example 2.1.2, the term in brackets is 
(B—B) H(B—B) + Q. (2.77) 


with H = H + /oX’X, B =H (Hf + hoX'Xb), and 


Q =hoy’y’ + B'HB — B HB (2.78) 
= hos? + (b—BY hoX'X(b—B) + (B-BYHB-B). 27) 
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where s? = (y°—Xb)’(y°—Xb). [Expression (2.79) follows from (2.78) in the same 
way as (2.57) from (2.53).] Substituting (2.79) in (2.77) and then (2.77) in (2.76), 
the marginal likelihood is 


Í. p(B | Ap’ | P, ho, X, A) dB 


= (2r) TOP AT? B| Í _exp(—[(B—BY H(B—B) + 01/2) 4B 


PA” exp(—O/2) (2.80) 


= (20) h” |H| 
= (20) "hg (| / E)" 
- exp{—[hos* + (b—B)'hoX’X(b—B) + (B — B)'H(B — B)]/2}. 


This expression indicates those features of a model that contribute to a higher 
marginal likelihood, and thus a greater weight in model averaging. 


Example 2.6.2 Marginal Likelihood in the Normal Linear Regression Model 
with Conjugate Prior Example 2.3.3 showed that the conjugate prior distribution 
in the normal linear regression model (2.10) is given by (2.48) and (2.49). The 
posterior density kernel in standard form is 


p(h | A)p(B | h, Apy’ |X, B,h, A) 
= (24217 wD]! (2) F742 (5?) 2? Hj? hT tk+x-2)/2 exp(—s7h/2) 
- exp{—hl(B — B)'H(B — B) + (y’ — XBY Y° — XB)]/2}. 


The last term in brackets can be expressed (B—B)'H(B—B) + Q*. The posterior 
parameters 6 and H are defined in (2.52) and Q* is given in (2.57). Then 


Í. ph | A)p(B | h, Ap’ | B,h,X, A) dB 
= [PT O/D 2r) Teda] |)? 
. hT te-2/2 exp[- (s? + Q*)h/2]. (2.81) 


The last line in this expression is a kernel of the density corresponding to the 
distribution (s? +0®h ~ xT + v), from which [recall (2.15)] 


[ hl +-2)/? exp[—(s? + O*)h/2] dh (2.82) 
0 


SPADER EDIE + Q*) TY". (2.83) 
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Substituting for Q* from (2.57) and then placing (2.83) in (2.81), the marginal 
likelihood is 


If pth | A)p(B | h, Ap’ | B, h, X, A)dhdB 


= a PTET + v)/21/T(v/2)}({H] / Ehe 
- [8 +s? + (b—B)'X’X(b—B) + (B — BY HB-B”. (2.84) 


From (2.74), the ratio of posterior probabilities of two models is 


P(A; | y’) £ P(Aj) p° | Aj) 
P(Ax ly’) P(A) p(y? | Ak) 


This ratio is central in comparing models. 


(2.85) 


Definition 2.6.1 In favor of the model A; versus the model Az, the prior odds 
ratio is P(A;)/P(Ax); the Bayes factor is p(y° | Aj)/p(y° | Ax); and the posterior 
odds ratio is P(A; | y°)/P(Ax | y°). 


In (2.85) the posterior odds ratio is expressed as the product of the prior odds 
ratio and the Bayes factor. The Bayes factor, in turn, is the ratio of marginal 
likelihoods. If a Bayesian investigator reports the marginal likelihood of a model, 
then others can use this to make comparisons with other models and include the 
model in the process of model averaging. In special cases there are analytical 
expressions for Bayes factors, and these reveal those properties of models and data 
that are important in the posterior odds ratio. 


Example 2.6.3 Bayes Factor for Two Normal Linear Regression Models with 
Conjugate Priors Suppose that there are two models 


y ~ N(XjB;.h;'Tr), 
shil Aj ~x), Bil hj AD ~ NEG, H> G=1, 2. 


Note that the vector y is the same for the two models. They may therefore be 
used in model averaging, and can be regarded as competing specifications for the 
observable y. From (2.84) and Definition 2.6.1 the Bayes factor in favor of model 
A, versus model A> is 


PY? |X AD _ TIT +¥)/21P (LN >"? 
Py? |X2, A2) DIT +v)/2 0/2 (Jm, |H] (s3)%/ 
[si +s? + (bi —B)'X} Xi (bi —B,) + (B, -PDH B, — BYE? 


[93 + 83 + (b2—B>)'X5X2(b2—B,) + (B, — PDH B, — By)" TH)?” 
(2.86) 
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In the same situation the likelihood ratio statistic is (s? jie yt 2. All of the mod- 
ifications of this statistic in (2.86) come about because of the prior distributions 
in the two models. Conventional non-Bayesian hypothesis tests comparing the two 
models are based on the fact that if one of the models nests the other—that is, if 
either X; = X»A>; or X = X,Aj2— then (s?/s3)—7/ 2 has a convenient sampling 
distribution. There is no such nesting required for the use of Bayes factors and 
posterior odds ratios. Furthermore, while X; = XA» implies that the likelihood 
ratio (s? ike 2 cannot exceed 1, the Bayes factor can be any positive number. 

Inspection of (2.86) indicates those aspects of model and observed data that lead 
to a higher Bayes factor in favor of the first model. If v; = v, = v, s = s2 = 5’, 
B, = b; and B, = bp, then the second line of (2.86) reduces to [(s? + s?) Ge + 
s2)] 77/2 Tf s? is small relative to s? and s3, and v is small relative to T, this 
is a minor modification of the likelihood ratio. The first line of (2.86) favors the 
model in which the prior precision is a relatively more important component of the 
posterior precision of the coefficient vector. In a situation in which prior means 
agreed with the least-squares fit, the model that concentrates its prior probability 
more intensely on this point is favored in the first line. 


In the model averaging process (2.73) it is the relative posterior probabilities, 
or equivalently the posterior odds ratios, of the models under consideration that 
matter. In general, there is no sense in which we are forced to choose among 
models. In some cases, however, choice of models is the essence of the decision 
problem; see, for example, Bajari and Lee (2003), who use alternative models and 
Bayes factors in deciding whether there has been collusion at an auction. With no 
real loss of generality, assume that there are only two models in the choice set. 
Treating model choice as a Bayes action, let L(A;, Aj) denote the loss incurred 
in choosing model A; conditional on model A; being true. Suppose further that 
L(A;, Aj) = 0 and L(A;, Aj) > 0 if i # j. Given the data y°, the expected loss 
from choosing model A; is P(A; | y°)L(A;, Aj) (j Æ i), and so the Bayes action 
is to choose model 1 if P(A% | y?)L(A1, A2) < P(A; | y?)L(Ao, A1), that is, if 


PALY?) POG) PGE JAD BUA A) 
P(A2|y?)  P(ADp | A2) L(A, A1) 


Definition 2.6.2 In choosing between two models, the ratio of loss functions 
L(A), A2)/L(42, A1) is the Bayes critical value. 


We choose model 1 if the posterior odds ratio in favor of it exceeds the Bayes 
critical value. For reasons of economy an investigator may therefore report only the 
marginal likelihood, leaving it to his or her clients—the users of the investigator’ s 
research—to provide their own prior model probabilities and loss functions. The 
steps of reporting marginal likelihoods and Bayes factors are sometimes called 
hypothesis testing as well. 
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2.6.2 Predictive Densities 


The marginal likelihood of model Aj, p(Y7 | Aj), is the measure of how well 
model A; predicted the data Y7 that are relevant for the comparison of model 
j with other models. In fact, there is a more formal link between the marginal 
likelihood of a model and the adequacy of the model’s predictions that underscores 
the predictive interpretation of p(Y%. | Aj). To establish this link, first consider the 
distribution of yr+1, ..., yr conditional on Y. and model A. 


Definition 2.6.3 The predictive density of yr+1, ..., yr conditional on Y4. and 
model A is 
PYT+1;---- YF | Yp, A). (2.87) 


The predictive density is relevant after formulation of model A and observing 
Yr = Yç, but before observing yr+1,..., yr. Once yr41, -.., yr are known, we 
can evaluate (2.87) at the observed values. 


Definition 2.6.4 The predictive likelihood of y7,,,..-,Y‘ conditional on Y7 
and the model A is the real number p(y7,,,.--,¥~% | Y7, A). 


It is natural to compare how well alternative models predict the same set of 
observations, by comparing their predictive likelihoods. 


Definition 2.6.5 The predictive Bayes factor in favor of model Aj, versus 
model Ax, is PYF- Yp | Y7, APOT yr | YT Ak). 


There is a formal link between predictive likelihood and marginal likelihood 
that is illuminating and useful, dating at least to Geisel (1975). 


Theorem 2.6.1 Representation of Predictive Likelihood The predictive like- 
lihood is a ratio of marginal likelihoods: 


PYT- Yr | Yr A) = pX | A)/p7 | A). 
Proof: 


P(Yr | A) = p(Yr | Yr, A)p(¥r | A) 
= P(yrui.---,¥r | Yr, A)p(¥r | A). m 


Theorem 2.6.1 shows that the predictive likelihood is the multiplicative updat- 
ing factor applied to the marginal likelihood p(Y7 | A), after the observations 
Yrai:-+->¥~ become available, that produces the new marginal likelihood 
pY% | A). This updating relationship is quite general. 
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Corollary 2.6.1 Decomposition of the Marginal Likelihood For any strictly 
increasing sequence of integers {s; Yo with so = 0 and s, = T, the marginal like- 
lihood may be decomposed 


q 
POG IA) = | [PO ya 1 9a A. (2.88) 
t=l1 


This result immediately implies that the Bayes factor in favor of model A; 
versus model A; can be decomposed in terms of predictive Bayes factors: 


p(X | Aj) I ae Fe] 


ps1 Ad | DG? a oy Y An) 


The decomposition in Corollary 2.6.1 summarizes the “out of sample prediction 
record” of the model as expressed in the predictive likelihoods. In the sense made 
precise by (2.88) and the use of p(Y7 | A) in model averaging [(2.73) and (2.74)], 
there is no distinction between a model’s adequacy and its out of sample predic- 
tion record. The decomposition (2.88) may be interpreted as a formal expression 
of Milton Friedman’s well-known identification of a model’s evaluation with its 
predictive performance: “Theory is to be judged by its predictive power ... The 
only relevant test of the validity of a hypothesis is comparison of its predictions 
with experience” [see Friedman (1953), pp. 8—9; emphasis in original]. There are 
striking similarities between Friedman (1953) and Jeffreys (1939, 1961). The third 
edition (Jeffreys 1961) contains, in Chapter 1, essentially the results presented here 
for the very special case of deterministic dichotomous outcomes. 


Example 2.6.4 Predictive Densities in the Normal Linear Regression Model 
with Fixed Precision Suppose that the specification of the normal linear regres- 
sion model (2.10)—(2.11) applies to F observations. Precision is fixed at h = ho. 
The covariate matrix X and outcome vector y° for the first T observations are 
known. For the last f = F — T observations the covariate matrix X is known but 
the corresponding outcome vector y° is not. Thus 


~ — — 1 
B | (y’,X,X,A)~ N(B,H),~ 
with H = H + hoX'’X and B = H (Hf + hoX’y’). Since 
¥ | (B.y’, X, X, A) ~ N(XB, hy 'Ip) 


it follows that 
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When Y = Y° is observed, the predictive likelihood for the last f observations is 
~ g _f2\on y —1 “1p 
p(y’ | y®, X, X, A) = 2r) P XH X +g Ty 
0) Yay vn Y = -10o _ YR 
expl- (§° — XB) (XH X'+ hg Ip)! — XB)/2]. 


(2.89) 
From (2.78) and (2.80) the marginal likelihood for the first T observations is 


1/2 Ta 


ply’ |X, A) = 2r) Thi’ |B| 


-exp[—(hoy”y? + B'HB — B HB)/2] (2.90) 


|H 


and the marginal likelihood for all F observations y% = (y”, y”) is 


12 1/2 


ply |X, X, A) = Qn) Fe’? lH ia 


-expl—(hoyZy? + B'HB — BHB)/21. 291) 


= a ~~ = =a] a 
In (2.91) H = H + hoX’X, and B = H (HB + hoX'y? + hoX’y’). 
Theorem 2.6.1 states that (2.89) is the ratio of (2.91) to (2.90). This follows 
directly once we establish two facts: 
hf A+ hX X/A = KA X aI 292) 
and 
~o Ryu Y = -1o _ YR ~or~o | BR RRA 
(P -XAXA X + hg I'O — XB) = boyy’ + B HB — BHB. (2.93) 
Restate (2.92) as 
[E + oX'R| = H|- [XE X + ng 'Iy| hg, 


and exploit the fact that if 


and A and D are nonsingular then 
|E| = |A| - |D — BA~'C| = |D|- |A— CD“'B| (2.94) 


[Rao (1965), Problem 1b.2.4]. Letting 


-H hX 
TOR y | 
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from (2.94) we have 


|-| - |1; + AXE X| = |= no X'R] = [A+ X'S 


= H| 5 It; +H X 


= [H]: |KE R + hg '1y| - ng. 


Turning to (2.93), we obtain the equation 


—+—_— 


B HB = B HB — B X'VXB + 2B X'Vy’ + hoy’ y -yvy 
where V = (XH 'X’+45'I,)~!. Note that 


=i 


= @+hX'X) =H -Ħ XVXH. 
Hence 


= ee ~ 
B=H (HB +hhXY?) 
— (A | -HA 'X’VXH |) (HB +X.) 
=B+hoH Xy -HA 'X’VXB -hH XVXH XY. 
Note also that 


== ae ea a 
BH = BH+ hoy”X. 
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(2.95) 


(2.96) 


(2.97) 


The left side of (2.95) is the product of (2.97) and (2.96). Expanding this product, 


we have 


I= 


P HB = p HB 
+ hoB XY 
—~BX'VxXB 
— hoB XVXH XY 
+ hoy” XB 
4 ay? SK 
— hoy’ XH | X'VXB 
-hy "XA X'VXH XY. 


(2.98a) 
(2.98b) 
(2.98c) 
(2.98d) 
(2.98e) 
(2.98f) 
(2.982) 
(2.98h) 
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Expressions (2.98a) and (2.98c) are the first two terms on the right side of (2.95). 
Expression (2.98b) is the same as (2.98e), and (2.98d) is the same as (2.98g). Twice 
the sum of (2.98b) and (2.98d) is 


2(hoB XY? — hop XVXH XY) = 2hoB XA, — VA XY 
= 2hoB X'y — VV — hg Ip ly’ 
= 2B X'Vy’. 
This is the third term on the right side of (2.95). Finally, the sum of (2.98f) and 


(2.98h) is i i 
hoy’ (XH X' -XH XVXH X’)y’. 
Employing the relationships VXH X = XH XV-=I poh IV, this expres- 
sion is 
KY'A X -XA XA; — ho VIY = hoy” XA X' V5” 
= hoy” Ip — hg VY 
= hoy’ ye = PVP. 
These are the last two terms on the right side of (2.95). 


Exercise 2.6.1 Comparison of Simple Normal Models Suppose y, iid y (u, 1). 
The sample size is T = 10; y = —0.2 and ye y?/10 = 1. For each of the follow- 
ing prior distributions, compute the numerical value of the log marginal likelihood. 
Explain the ordering of the values that you obtain. 


(a) u =0. 
(b) u ~ N (0, 1). (This is a special case of Example 2.6.1.) 
(c) u ~ N (0, 1) truncated to u > 0: 


pu) = 1/27"? exp(—p?/2) 10,4). 
(This is a variant on Example 2.6.1.) 


Exercise 2.6.2 Models for Positive Observables (This is a continuation of 
Exercise 2.4.6.) The observables yı,..., yr are i.i.d. and strictly positive. In 
model A 

PO: |0, A) = 0 exp(—8 yr) (0,00) r) 


while in model B 
pO: |h, B) = (r /27 h! exp(—hy?/2)I0,%) (Y). 


Derive the Bayes factor in favor of model A. 
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Exercise 2.6.3 Model Combination Suppose that there are three models (A, 
B, C) for the observable y. Each model completely specifies the distribution of 
y—there are no unobservables in any of the models. The complete specification is 


Model Model Prior Probability y Density œw Density 


A p(A) ply | A) p@|y, A) 
B p(B) py | B) plo |y, B) 
C p(C) py |C) pæ |y, C) 


(a) Suppose that the investigator’s problem is to choose one of these models 
subject to the following loss function: 


Choice | Truth—> A B C 
A 0 L(A, B) L(A, C) 
B L(B, A) 0 L(B, C) 
C L(C, A) L(C, B) 0 


Formulate an explicit rule for model choice. Be as specific as you can. 


(b) Now suppose that the investigator’s problem is to estimate w, subject to a 
PP 8 P. J 
quadratic loss function. What is her estimate? Be as specific as possible. 


(c) Finally, suppose that the investigator’s problem is to form an estimate ©, 
using the loss function |© — œ|. What is her estimate? Be as specific as 


possible. 


CHAPTER 3 


Topics in Bayesian Inference 


This chapter continues the development of principles of Bayesian inference. While 
the topics treated here are not essential to the specific models taken up in 
Chapters 5-7, they provide a greater depth of understanding that often yields divi- 
dends in Bayesian investigation. Much of the chapter addresses the prior distribution 
of unobservables. The development of hierarchical priors (Section 3.1) illustrates 
how models can be enriched with large numbers of unobservables as long as prior 
information provides sufficient structure. The treatment of improper prior distri- 
butions (Section 3.2) emphasizes their interpretation as limits of proper priors. 
Section 3.3 provides one approach to the common situation in which the investi- 
gator does not know the client’s prior distribution or even the client. The chapter 
treats two other topics, as well. Asymptotic analysis (Section 3.4) derives conditions 
under which posterior distributions collapse to points as sample size increases, and 
further conditions that imply that the posterior distribution approaches the normal 
distribution. The chapter concludes with a discussion of the likelihood principle, 
which states that data-based information is conveyed entirely through the likelihood 
function. Bayesian inference is always consistent with the likelihood principle, and 
Section 3.5 illustrates how violations of this principle can lead to unreasonable 
decisions. 


3.1 HIERARCHICAL PRIORS AND LATENT VARIABLES 


The use of unobservable, or latent, variables has a history of more than a half- 
century in econometrics and the social sciences; see Goldberger (1974) or Gril- 
liches (1977) for a recounting of the origins of modeling with latent variables. 
In Bayesian statistics, the concept of the hierarchical prior distribution was intro- 
duced by Lindley and Smith (1972) and Smith (1973). In both cases the techniques 
have substantially increased the flexibility and applicability of inference and are 
widely used. While developed independently, the two principles are identical from 
a Bayesian perspective. Moreover there is a natural congruence between these 
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methods and Markov chain Monte Carlo posterior simulation methods discussed 
in Section 4.3. The following example illustrates the main ideas in a simple 
setting. 


Example 3.1.1 Prior Distributions in a Model for Many Means Outcomes y; 
are observed for individuals i = 1, ..., and time periods (or trials) t = 1,..., T. 
Suppose that a complete model A, includes 


Ya = Mi + Eim Eu NO, h’), (3.1) 
aw | Ay ~ N(u,H'), (3.2) 


where w’ = (Mi, ..., Up), and an independent prior p(h | A,) that need not be 
further specified for the purposes of this example. The prior distribution of m 
incorporates the idea that there is substantial uncertainty about the means p;, but 
that relative to this uncertainty the means are likely to be similar, although not 
identical. This idea could be expressed through E(w | A) = M = tny, where tn is an 
n x 1 vector of ones, var(u; | A) = HË = h™!, and cov(y;, a; | A= Hï = pho. 
The investigator provides the numerical values of u, h > 0, and p e (0, 1). The 
closer p is to one, the more similar are the means ji, in the prior specification of 
the model. 

An alternative, but equivalent, complete model A» retains (3.1) but in place of 
(3.2) introduces a hierarchical prior distribution. This distribution begins with a 
new unobservable 


u | A2 ~ N(u, ph’). (3.3) 
Then the means 44, ..., Un are conditionally independent, with 
pi | (u, A2) ~ Nw, (= p)h™']. (3.4) 


Taken together, (3.3) and (3.4) are equivalent to (3.2). 
Yet a third complete model A3 substitutes 


Yit = Zi + Eit 


for (3.1). The random variables Z; are latent—that is, they are never observed. If 
A3 specifies the distribution of latent variables 


he iid. AR 
Zi | A3 "S N[w, (1 — ph") 


and the prior distribution u | Az ~ N (u, ph-'), then A3 is equivalent to A2; in 
fact, Z; = p. 
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More generally, a complete model A with a two-tier hierarchical prior distribu- 
tion specifies a conditional observables density 


py | Nas Ya, A). (3.5) 


In the first tier, the prior distribution of A,4 is expressed conditional on a vector of 
unobservable hyperparameters ¢ 4: 


PQa | @4, A). (3.6) 
The term “hyperparameter” denotes the fact that @, is not a parameter of the observ- 


ables density (3.5). Rather, it is a convenient construct for expressing uncertainty 
about 24, by means of the prior density 


P(Y a a | A), (3.7) 
which completes the model. The prior density of all the unobservables is 
PAA, Was $a | A) = paa | Oy, AVP a, a | A). (3.8) 


A simple complete latent variable model B includes a vector of unobserved 
(latent) variables Z and a conditional observables density 


P(Y |Z, Yg, B), (3.9) 
a model for the latent variables 
P(Z| $g, B), (3.10) 
and a prior density for the unobservables pp, and $g: 
P(Y B, Op | B). (3.11) 
Then the prior density of the full vector of unobservables 0’, = (Z, W',, Øp) is 
P(Z YB, Os | B) = p(Z | $g, BYP, Oe | B). (3.12) 
Comparing (3.5)—(3.8) with (3.9)—(3.12), it is apparent that the complete model 


with a two-tier hierarchical prior distribution and the simple latent variable model 
with a conventional prior distribution are formally identical. Since 


PYY | b4, Wa, A) -f PQA loa APO làa Wa, A) dv(Aa), 
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the unobservables vector à 4 is formally redundant. The reason for introducing A. 4 is 
that it facilitates expression of the prior distribution, makes analysis of the posterior 
distribution easier, or both. Likewise the latent variable model implies 


PY bs. Ve B= | pls Doo IZ, Wg, B) dv(Z). 


An advantage of the latent variable formulation in a Bayesian context is that it 
obviates the need to evaluate the likelihood function 


PY | $s, Vs, B) = / PĒ | bs. BPO Ë, Yp, B) dv) 
Z 


analytically, a task that is impossible in some applications, for example, the multi- 
nomial probit model [see Geweke et al. (1994) or McCulloch and Rossi (1994)]. 

A further advantage of hierarchical prior distributions is that they are often the 
natural vocabulary for generalizing a model, and they facilitate the expression of 
conditional posterior distributions that are central in the Markov chain Monte Carlo 
posterior simulators introduced in Section 4.3. 


Example 3.1.2 Posterior Distributions in a Model for Many Means In the 
complete model Az of Example 3.1.1 consisting of (3.1), (3.3), and (3.4), the 
conditional posterior distributions of the means u; are independent: 


o ee . 
li | (u, h, y°, A2) ~ NG h; ) GH 1,..., 0) (3.13) 


with 


T 
hi =(1—p)'A+Th, T; =h; fe —p)'hu+hy> J ; (3.14) 


t=1 


Note that u does not appear in (3.14). The conditional posterior distribution of u is 


5 _ —-1 
U | (His... Unh, y , A2) ~ Nh ) (3.15) 


with 


h=p'h+nl—p)'h T = m` forme (di — a ae J . (816) 


i=1 


Note that the data y° do not appear in the latter distribution. The conditional distri- 
butions (3.13) and (3.15) are the basis for a Markov chain Monte Carlo posterior 
simulator. 

In Example 3.1.1 the investigator specified numerical values for p and h in 
the prior distribution. Suppose instead that the investigator regards p and h as 
unobservables, and reflecting this fact replaces them with the symbols p* and h*, 
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assigning them the independent prior, distibutnns s*°°h* | Ap ~ x7(v*) and p* | 
A ~ uniform(p7, P3 *), where s**, v* ie and p> represent positive real numbers 
with P% < P3 < ne Note that the conditional distributions of the means Mi ae the 
hyperparameter u remain as they are in (3.13)—(@.16)—in particular, s* “, Pi 
and 25 do not appear in these expressions. Moreover, the conditional eee 
distributions for o* and h* do not depend on y? or h. 


This example indicates how a hierarchy of prior distributions may be extended. 
The fact that the conditional posterior distribution of u does not depend on the 
data in (3.15) is the manifestation of a universal characteristic of the vector of 
hyperparameters in a model with a two-tier hierarchical prior. From (3.5)—(3.7), 
we obtain 


Pha | Aa, Wa, ys A) & pa b4 | Apa l Wy, ba, APIA Wa, A) 
x pha pal Apa | Wy, pa, A). 


See Exercise 3.1.2 for an extension of this idea. 
Exercise 3.1.1 Completing the Argument Derive (3.13)-(3.16). 


Exercise 3.1.2 Multitier Prior Distributions Consider a model with an (n — 1)- 
tier hierarchical prior distribution of the unobservables. The conditional pdf of the 
observables is p(y | 041, A), and the prior density is 


n—1 


P@a | A) = pan! A) | [PO | 0a, A), 
i=l 


where 6’, = (6',,,..., 04n). Show that in the conditional posterior densities p[04; | 
Oai (i Æ j), y°, A], the vectors 04; do not actually appear unless i = j — 1 ori = 
j +1, and the data vector y° does not appear unless j = 1. 


Exercise 3.1.3 Hierarchical Prior Distributions In model A, the distribution of 
= (y1,..-, yr) conditional on x’ = (xj_p,..., XT—1) is 


P 
Ye = Bo + > Bids Fer € | (h, x, AYN, h-'Ty), 


s=l 


where e = (€1,..., €r). The investigator would like to complete the model with 
the independent prior distributions 


B =(Bo,---+Bp)'~N(B, H'), s°h ~ x’). 


Her beliefs are well represented by £ = 0. In choosing H she wishes to express the 
idea that for s = 1,..., p the coefficients 6, are likely to be smaller in absolute 
value, the greater is s, but she is not sure how quickly they become small. 
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(a) Set up a hierarchical prior distribution that could represent the investigator’s 
beliefs about 6,,..., 8 p that is, set H = H(@), and then choose a prior 
distribution p(@ | 4) in which the investigator will fix the value of 2. 

(b) Corresponding to the prior distribution you chose in (a), express: 

(G) p(B ly, x, A, B, 8”, v, p, à, A); besides B, this expression should 
involve only y, x, h, B, and ¢. 

Gi) p(h | y, x, B, B, 5”, v, p, à, A); besides h, this expression should 
involve only y, x, B, s?, and v. 

(iii) p(¢| y, x, h, B, B, s’, v,à, A); besides ø, this expression should 

involve only B, B, and A. 


3.2 IMPROPER PRIOR DISTRIBUTIONS 


Bayesian investigators often report results using prior distributions that are widely 
dispersed, so that their densities are nearly flat, at least over the range of the param- 
eter space ©, in which the likelihood function is concentrated. As we shall see in 
Section 8.4, there can be sound technical reasons for doing this. But this procedure 
also looks appealing on the grounds that a nearly “flat” prior density seems to con- 
vey very little information, and is therefore appropriate in communicating results 
to a diverse group of people who may have very different priors. This rationale is 
misleading, and this can be seen by considering the effects of reparameterization 
of a model. Suppose that from the model A we create the model B by taking 
603 = f (04), and f(-) is one-to-one. Then we can write 04 = h(@g) with h = f—! 
and 0g € Opg, with ©, the image of ©, under f(-). The new observables density is 


PO | 0s, B) = ply | 84 =h@s), Al. (3.17) 
The new prior density is 
PO | B) = pO, =h@s) | Al |[dh@g)/06']| - (3.18) 


Note that because of the Jacobian term in (3.18), p(@z | B) can be made nearly 
“flat” when p(@4 | A) is not, by appropriate choice of f, and vice versa. For the 
vector of interest œ, we have 


P@|y, 4s, B) = pl | y,64 = hz), Al. (3.19) 


For purposes of learning about @ it does not matter which model is used because 


poly Ba | 


© 


p(o |y, 0g, B)p(y | 0g, B)p(Oz | B)dv(0g) 


-f Plo | y, 04 =h(@z), Alply | 04 =h@z), A] 
Os 


-pl0 a = h(0 g) | Al |[dh(g)/30']| dvOz) 
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=f P@ |y,04, A)ply |04, A)p@a | A) 
©, 


-Jf (6.4)/904]| | [af @4)/20']| dv(O4) 
= p(@|y, 0,4, A)ply |04, A)p(O4 | A) dv(4) 
x p(@ |y, A). 


[The first equality simply substitutes from (3.19), (3.17) and (3.18). The second 
equality is a conventional change of variable from 0g to 04.] Evidence about w 
will be the same in models A and B, and yet the prior in B can be manipulated 
to be nearly “flat” in the parameter space ©g. Thus the shape of the prior alone is 
no indication of how much information it conveys. 

To develop prior distributions that may nonetheless prove useful in communi- 
cating results, consider a sequence of models A = {A pap each with the same 
parameter vector 04, data density p(y | 04, A), and vector of interest œ, but with 
different prior densities p(@,4 | A;). Let k(@,4 | Aj) be a sequence of kernels cor- 
responding to the sequence of prior densities, k(@4 | Aj) x p(@4 | Aj). Then the 
corresponding sequence of posterior densities for 64 is 


Pa ly’, Aj) x ply’ | Oa, AkO | Aj). 


It may turn out that k(@,4 | Aj) has a pointwise limit k(@,4 | A) that is not finitely 
integrable—that is, it is not the kernel of any pdf. At the same time, it may be the 
case that limjoo p(Oa | y, Aj) œ k(O4 | A) p(y | 04, A) is finitely integrable and 
therefore is a well-defined posterior density kernel. For many purposes, analysis 
may be carried out using k(@4 | A), ignoring the fact that it cannot be the kernel 
of a prior density. The following definition, theorem, and three corollaries develop 
these ideas more carefully. 


Definition 3.2.1 Let k(@4 | Aj) be a sequence of prior density kernels for 
which k(@4 | A) = limjook(@4 | Aj)V04 € ©; exists but is not finitely inte- 
grable. If limj.. p(@4 | y°, Aj) x k(04 | A) p(y? |04, A) exists and is finitely 
integrable, then k(04 | A) is an improper prior density kernel for 04 in the model 
A with data y°. 


An attraction of using an improper prior distribution is that it can reflect some 
limiting properties of the sequence of distributions @ | (Y7, Aj) and moments 
E[h(@)|(y°, A;)]. It is important to establish the conditions under which this will 
happen, and to see exactly which limiting properties are reflected in the posterior 
distribution that employs the improper prior distribution. 


Theorem 3.2.1 Convergence of Posterior Densities Given a Sequence of Prior 
Densities Let p(04 | Aj) be a sequence of prior densities with corresponding 
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kernels k(@4 | Aj). Suppose that for all 04 € @,, k(@,4 | Aj) is monotone non- 
decreasing with 


lim k(@4 | Aj) = k(@a | A), 
jroo 
where k(@4|A) is an improper prior density kernel for 04. Suppose further that 


Cj si Ply? |04, A)K@4 | Aj)dv(0a) (3.20) 
On 
has finite limit c. Then 04|(y°, A;) ze O4\(y°, A) and a kernel of the limiting 
posterior distribution limj—o. p(@4 | y°, Aj) is p(y? | 94, A)k(04 | A). 
Proof: Clearly 


Hus PO? | 84, Aja | Aj) = Ply? |04, A)K@O4 | AYO € Og. 
By the monotone convergence theorem (Royden 1968, Section 4.2) 
c -f PO’ |04, A)k04 | A)dv(0a). 
©4 


Consequently 


jim pi" |04, Aj)k@a | Aj)/cj = ply’ |04, A)kO4 | A)/c Y 04 € Oa, 


which is equivalent to limj... p(@4 |y°, Aj) = p(04 | y°, A)Y0 4 € Oy. By 
Scheffe’s theorem (Rao 1965, Theorem 2c.4.xv), 0a|(y°, Aj) g 04l(y°, A). E 


When the conditions of Theorem 3.2.1 are satisfied, reports of posterior densities 
of parameters using the improper prior can be interpreted as limits of sequences of 
posterior densities employing priors whose kernels converge to the improper prior 
kernel. These conditions imply that there exist convergent sequences of credible sets, 
as well—that is, P(04 €C |y°,A)=1-—ga > limj>œP(04E€C|y,A;j)= 
1 — a. Under further weak conditions, the improper prior also provides limits of 
moments of 0 4 and functions of 0 4. 


Corollary 3.2.1 Convergence of Posterior Moments Given a Sequence of Prior 


Densities Suppose 04|(y°, A;) 4 04|(y°, A) and g : ©, —> R is continuous. If 
limj>œ E[g(@4)| y°, Aj] = g*, then g* = E[g(04)| y°, A]. 


Proof: The conditions imply g(6@4)|(y’, Aj) ZA g(O,)|(y?, A) (Rao 1965, 
Theorem 2c.4.xii). The result follows from Rao (1965), Theorem 2c.4.viii. E 
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More generally, it is usually the case that œ | (y’, A;) Low | °, A), but some 
conditions on p(@ | y°, 04, Aj) are needed. 


Corollary 3.2.2 Convergence of the Posterior Distribution of a Vector of Inter- 


est Given a Sequence of Prior Densities Suppose 04|(y°, Aj) ze Oal(y’, A), and 
for all w € Q, g(04; @) = p(@ | y’, 04, A;) is a continuous function of 04. Then 


d 
w | °, Aj) > @| (y’, A). 


Proof: The conditions imply 


lim inf f (84; @) p(O4 | y°, A;)dv(04) 
Oy 


joo 
> | 8004; 097104 y’, A dvn, 
Oa 
(Rao 1965, Theorem 2c.4.vii), which is equivalent to 
lim inf plo ly’, Aj) = pl@ | y°, A). 
But 


Í plo | y°, Aj)dv(@) = f pæ | y°, A)dv(@) = 1 Vj, 
Q Q 


and hence limj_... p(@ | y°, Aj) = p(@ | y°, A) except possibly on a set of v- 
measure zero. Thus 


lim f | p(@ | y’, Aj) — plo | y’, A)| dv(@) = 0, 
J> œ Q 
and the result follows from Scheffe’s theorem. | 


Finally, posterior moments E[h(@) | y°, A;] converge if in addition A(-) is con- 
tinuous. 


Corollary 3.2.3 Convergence of Posterior Moments of a Vector of Interest 


Given a Sequence of Prior Densities Suppose æ | (y°, Aj) a æ | (y’, A) and 
h : Q — R is continuous. If E[h(@) | y°, Aj] > h*, then h* = E[h(@) | y’, Al. 


Proof: Identical to that of Corollary 3.2.2. | 
Example 3.2.1 A Sequence of Diffuse Priors for f in the Normal Linear 
Regression Model In the normal linear regression model (2.10) fix the precision 


at h = hy and consider the sequence of prior distributions 


B | A;~NIB, (a DG =1,2,...). 
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The sequence {a ;} is monotone decreasing with limj... a; = 0. A corresponding 
sequence of prior density kernels for B is 


k(B | Aj) = expl—a,(B — B)'H(B — B)/2] œ p(B | Aj). 


The function k(B | A;) is monotone increasing to 1 except at B = B where k(f | 
Aj) = 1. Hence it satisfies the condition on kernels in Theorem 3.2.1 with k(B | 
A) = 1VB e R*. Proceeding as in (2.18a)—(2.21), the corresponding sequence of 
posterior density kernels is 


p(y’ | BX, A)k(B | Aj) = Qr) Th. 
-exp{—[(8 — B;)’Hj(6 — B;) + Q;1/2}, 3.21) 


where 
Hj =a,H + hoX'X, B; =H; '(ajHB + hoX’Xb), 
and 
Q; = hoy”y’ +a;B'HB — Bj HB,. 
Thus 


[ PUY? | BLX, ADK | Aj) AB = Qr) ho exp(—Q;/2) [H|”” < œ, 


which converges to 


Qr) Dny O expl—holy? -Xb)'(y°-Xb)/2 |XX". 


Hence from (3.21) and Theorem 3.2.1 B | (Y°, X, Aj) 5 B | (y’, X, A), with the 
kernel density of the latter distribution given by 


PY | B, X, Aj)k(B | A) œ exp[—ho(B — b)'X'X(B — b)/2] 


which shows that £ | (y°, X, A) ~ N[b, (hoX’X)~!]. From Corollary 3.2.1, we 
have 


EIB | 9°, X, Aj)] > b, var[B | (y’, X, Aj)] > (hoX'X)™'. 
Suppose that the function of interest is the r x 1 vector @ = y, = X,B + Ex, 


where e, ~ N(O, ho 1L.) is independent of e. (This is the conventional “prediction 
problem” discussed in many basic econometrics texts.) Then 


p(@ | B, X,, A) Œ exp[—ho(@ — X,.B)'(@ — X,.B)/2], 
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which is continuous in $. From Corollary 3.2.2, we obtain 
oO d a) pe = 
@ | (y’, X, Aj) > @| (y’, X, A) ~ N{Xxb, ho '[(X’X) +L]. 


Since the elements of œ and ww’ are continuous functions of œ, it is also the case 
(Corollary 3.2.3) that 


E[@ | (y’, X, Aj)] > Elo | (Y°, X, A)] = X*b 
and 
var[@ | (Y°, X, A;)] > var[@ | (Y°, X, A)] = ho‘ [(X'X)7! +L]. 


An important limitation of improper priors is that they lead to models whose 
marginal likelihood is zero. 


Theorem 3.2.2 Marginal Likelihood Given an Improper Prior The condi- 
tions of Theorem 3.2.1 imply 


lim pa | App | 04, A)dv(O4) = 0. 
Ja 


j>% Je 


Proof: Let dj = fo, k(8q | Aj) dv(04). Then 


/ pO | Appo | 84, A) dv(O4) = tld; 
Oa 


where c; is as defined in (3.20). Since d; — œo and cj > c < œo, the result 
follows. E 


As consequences, a model A with an improper prior distribution has no weight 
in model averaging (2.73)—(2.74) and will never be selected in a model choice 
decision problem. The latter result is widely known as “Lindley’s paradox,” after 
Lindley (1957) and Bartlett (1957). 


Example 3.2.2 Marginal Likelihood for a Sequence of Diffuse Priors for f 
in the Normal Linear Regression Model Continuing with Example 3.2.1, from 
(2.80) the marginal likelihood of the jth model in the sequence is 


1/2 1/2 


Qr) Th” |ajH|'* |ajH + hX’ X| exp(—Q;/2). (3.22) 


Since limj+ooQj; = —ho(y°—Xb)'(y’—Xb), the limit of (3.22) is zero. 
Theorem 3.2.1, Corollaries 3.2.1—3.2.3, and Example 3.2.1 show that there are 


reasonable conditions under which the use of an improper prior can be inter- 
preted as a limiting case of the use of a sequence of prior distributions. This 
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sequence in turn may be interpreted as “increasingly less concentrated” in the 
sense that limj—oo p(@4 | Aj) = 0V0 4 € ©4. But the argument made at the start 
of this section shows that interpreting this sequence as an approach to “uninforma- 
tive” priors is treacherous because it need not be invariant under transformation. 
In view of this difficulty, Jeffreys (1961) proposed a particular prior density that 
is unique under transformation. 


Definition 3.2.2 If {y;} is iid. with pdf p(y; |04, A), and p(y; |04, A) is 
differentiable with respect to 04Y0 4 € ©4, then the Jeffreys invariant prior density 
kernel is 


al 0a, A) al 04, A He 
kOn | A) o |p [LEPR aA Pepa oa] (3.23) 
A A 
321 U. d 
= -e [ee : 4 104, A (3.24) 
A 


Note that expectation is with respect to the random vector y; and not the con- 
stant vector 64 in (3.23). The equality in (3.24) is a property of probability 
densities widely used in non-Bayesian statistics; for example, see Poirier (1995), 
Theorem 6.5.1. 


Theorem 3.2.3 Invariance of the Jeffreys Prior The Jeffreys invariant prior 
density is invariant under one-to-one reparameterization. 


Proof: Construct the model B by taking 03 = f (04), 04 =h(0g) and 
PCy: | 98, B) = ply: |04 = h(0 g), A]. Applying (3.23), we obtain 


kp | B) =k{h(@g) | Al- [LAO 5) /305]]| 


_ |p [210g Piy: |h@s), A] alog ply, | h@x), Al] |” 
7 dnp) dh(O py’ 
dh(Oz) 
A 
« [ae tena 
_ |p [220e] 210g ply: | hs), Al 8 log ply: | hOB), A] 
TA Əh 5) ðh (O g)! 
1 1/2 
h 
A (2) |04, A 
30", 
— |g { 2208 Ply: | 0s, B1 3log piy: | 85. BI anae E 
30 g 30’, 


Example 3.2.3 Jeffreys Invariant Prior for the Bernoulli Distribution For 
a sequence of ii.d. Bernoulli trials with outcomes y, =0 or y =1, p(y | 
0) =0"(1—0)°-™, Then log pO, | 6) = y:log@) + (1 — y,) log(1 — 6), and 
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dlog p(y; | 0)/do = 6-!y, + (1 — 0)! (y, — 1). Since 


E[d log p(y; | @)/doP = El0™ y, + A — 0) O, — DP 


=00'P+(-0)1-6)* =0'0 -6)", 


the Jeffreys invariant prior density kernel is k(@) « 67'/7(1 — 0)". 


Exercise 3.2.1 Improper Prior Distributions in the Normal Linear Regression 


Model 


In Example 3.2.1 the precision h was fixed. Suppose instead that there is 


a sequence of prior distributions for h, sh | Aj~ xX PE 


(a) Let s = qv; where q > 0, and suppose limj—>œ v; = 0. Find a correspond- 


(b 


(c 


<~ 


<~ 


ing sequence of kernels k(h | A;) satisfying the conditions of Theorem 3.2.1 
and for which limj—ook(h | Aj) = k(h | A) = h™! Io,» (h). 


Suppose that in the normal linear regression model of Example 2.1.2 
qujh|Aj~x?(v;) and B| Aj ~ NIB, (@jH)~'] 


are the independent prior distributions for h and B. Also suppose that 


rank(X) = k. For limj-..0 v»; = limj—o0 aj = 0, write the limiting posterior 


density kernel k(B, h| y°, X, A). Show that this is a density kernel of the 
distribution 


s*h | (y’,X, A) ~ x°(T — k), 
bIh, y’, X, A) ~ N[b, (AX’X)“'], 


where b = (X’X)~!X’y® and s? = (y — Xb)/(y — Xb). Thus £ and h have a 
normal-gamma posterior distribution. It follows (recall Example 2.3.3) that 


B | (Y°, X, A) ~ t[b, (T — k)7!s2(X'X)7!; T — k]. 


Suppose that in the normal linear regression model with conjugate prior 
distribution of Example 2.3.3 


qujh | Aj ~ x7(v,) and B | (h, Aj)~NIB, (ajhH)“'1. 
Also suppose that rank(X) = k. For limj—oo v; = limj-... aj = 0, write the 
limiting posterior density kernel k(B,h | y°, X, A). Show that this is a den- 


sity kernel of the distribution 


s*h | (y’,X, A) ~ x7(T), 
B\(h, y’, X, A) ~ N[b, h(X’X)7!]. 


It follows that £ | (Y°, X, A) ~ f[b, s2(X’X)7!; T]. 
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(d) The conventional non-Bayesian treatment of the normal linear model is 
derived as follows: 


hs? | (B, h, X, A) oy. x° (T ba k), 
b|($, h, X, A) ~ NIB, (hX’X)“"], 
bi(B, X, A) ~ t[B, (T — k)7!s?(X'X) !; T — k]. 


It is common to give an informal Bayesian interpretation of these results 


in statements such as “The probability that 6, is negative is... .” Using the 
results in parts (b) and (c), provide a formal Bayesian interpretation of such 
statements. 


Exercise 3.2.2 An Invariant Prior Distribution Suppose that the observables 
are independently and uniformly distributed on the interval (0, 0). 


(a) What is the Jeffreys invariant prior distribution for 0? Is this prior conjugate? 

(b) Consistent with Theorem 3.2.1, can you find a sequence of proper prior 
densities p(8 | Aj) with kernels k(@ | Aj) > k(@ | Aj_1), that has as its 
pointwise limit the prior distribution you found in (a)? 


Exercise 3.2.3 Jeffreys Prior for the Exponential Distribution Sometimes the 
pdf of the exponential distribution is written p(y | 0, A) = 0 exp(—O@y), and some- 
times it is written p(y | à, A) = A~! exp(—y/A). 


(a) Derive the Jeffreys prior for 0 and the Jeffreys prior for A. Show that these 
priors are improper. 

(b) Derive the corresponding posterior densities for an i.i.d. sample y° and show 
that for any finite interval S of the real line, P(@E S| y°, A) =P Gee 
S|y?, A). 

(c) Suppose that instead of the Jeffreys prior for 6, we used the improper “flat” 
prior p(@) x I(o,o0) (8). Given a sample of size T = 1 with single observation 
y? = 1, compute the posterior probability that 6 < 1. 

(d) Suppose now that we used the same improper “flat” prior for A. Try to 
find the posterior probability that à > 1, given the same single observation 
yh 


Exercise 3.2.4 Jeffreys Prior for the Poisson Distribution Find the Jeffreys 
prior for the parameter @ of a Poisson distribution [see Exercise 2.3.1(d)], assuming 
i.i.d. sampling. Is the prior conjugate? 


Exercise 3.2.5 Lindley’s Paradox Suppose y; oN (u,h-') where h is 
known. Here are four alternative prior distributions for ju: 


e H — 0 
e p(w) & Io, (u) 
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© p(n) x Io,œ) (H) 
° pu) xX T(~0,00) (u) 


Given a sample of size T, there are six distinct Bayes factors for pairs of these 
hypotheses that could be formed. 


(a) Which Bayes factors will be trivially zero or infinite, and why? 
(b) For the nontrivial pairs, express the Bayes factors using standard notation. 


Exercise 3.2.6 Predictive Densities and Improper Priors Suppose that in the 
normal linear model (Example 2.1.2) h is fixed at h = ho. Partition X and y: 


Xı yı 

Tı xk Tıx1 
X gg ’ 

X2 y2 

Tr xk Tx 


In model A, B ~N(B,H'). There is a sequence of models {B j} that differ from 
A and from each other only with respect to the prior distribution for B : B | B j~ 
N(B,(j + 1)H"'). There are data yi for the observable yı, but y2 has not been 
observed. The covariate matrix X is known. 


(a) Find the limiting distribution lim; £ | (X1, y?, Bj). 

(b) Show that lim; p(y? | X1, Bj)/p(y{ | X1, A) = 0. 

(c) Find the distribution y2 | (X;,X2,y{, A) and the limiting distribution 
lim joo Y2 | (Xi, X2, y, Bj). 

(d) Now suppose that we obtain the data y5. Is it the case that 


lim pY | X1, X2, yi B;) = 


se cae Aaa A) 
j>œ p(ys | Xi, X2, yf, A) 


3.3 PRIOR ROBUSTNESS AND THE DENSITY RATIO CLASS 


In many instances Bayesian investigators do not know their clients’ priors, or even 
the identity of their clients. For example, the investigator may be an academic 
economist and the clients, the readers of an article published by the economist. 
A number of approaches can be taken in this situation. One is to report posterior 
moments corresponding to alternative priors, but such an enumeration can become 
tiresome long before reasonable possibilities for priors are exhausted. Another is to 
provide clients with the ability to modify priors simply and directly and examine 
the impact on posterior moments. Section 8.4 discusses this approach. Another 
possibility is to report a range along with each posterior moment, corresponding 
to all possible prior distributions within a specified class of distributions. Several 
interesting classes of prior distributions have been proposed, reviewed in Berger 
(1994) and Wasserman (1992). 
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This section takes up the density ratio class of prior distributions. This class 
consists of all prior distributions with a probability density kernel k(04 | A) that 
satisfies the inequalities a(04) < k(@4 | A) < b(@,4), where a(04) and b(04) are 
kernels of prior densities that yield proper posterior densities. A case of particular 
interest is b(0,) =r -a(04). The density ratio class then permits ratios of prior 
probabilities of subsets of ©, to vary by a factor of up to r from the correspond- 
ing ratios implied by the prior density kernel k(@,4 | A). If we interpret improper 
prior density kernels as assigning relative probabilities to ©% C @, for which 
Sox k(04 | A)d04 < ov, then the density ratio class can be extended to improper 
prior distributions with the same interpretation. 

The development here was first presented in Geweke and Petrella (1998). It 
builds on work in Wasserman and Kadane (1992) and Lavine (1991a, 1991b), and 
is the basis for the routine and efficient computation of bounds of posterior moments 
E(@ | y°®, A) approximated by posterior simulators developed in Section 8.5. 

In this section, let g(04) = Jo h(@) p(@ | y°, 04, A) dv(@). For any prior kernel 
k(04 | A), proper or improper, we obtain 


f KOs | A)p(y’ 184, Ades) dv(04) 
Elg(@a) | y°, A] = 2 (3.25) 


f ks | APO? |04, A) dv@a) 
Oa 


Let a(0 4) and b(0 4) be given functions for which 0 < a(@4) < k(04 | A) < b(04) 
v0 a € Og, and b(,4) p(y? | 04, A) is finitely integrable on ©4. The formal prob- 
lem is to determine the range of values of (3.25) over the set S of all prior density 
kernels k(04 | A) satisfying 0 < a(04) < k(04 | A) < b(0 4), that is, to determine 


Elg@a) ly’. Al = inf E[g(04) ly’, A] 
and 


Elg@a) ly’, A] = sup Elg (04) ly’, A]. 


Because E[g(04) | y°, A] = —E[—g(04) | y°, A], only the maximization prob- 
lem need be considered. The following result was shown in DeRobertis and Har- 
tigan (1981), Proposition 4.1; Lavine (1991b), Claim 3; Wasserman and Kadane 
(1992), Theorem 4(b); and Wasserman et al. (1993), p. 308. A proof is included 
here because it parallels a similar result based on posterior simulators presented in 
Section 8.5. 


Theorem 3.3.1 Bounding a Posterior Moment over the Density Ratio Class 
of Priors Let b(04) be a prior density kernel for which the posterior density 
kernel b(04)p(y° | 04, A) is finitely integrable on ©4, and let a(04) < b(04) be a 
second prior density kernel. Let S be the set of all prior density kernels k satisfying 
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a(04) < k(04) < b(04)VO4 € Og, and suppose that (3.25) is bounded above for 
k € S. Let 


a(O4) if g@a)<c 


Mase = to, E 


Then the unique solution of 


OR f [g(@4) —c]ply® |04, A)k(O4; c) dv(04) = 0 (3.26) 
Ox 


is €= E[g@a) | y?, A] = supzes Elg (Oa) | y’, Al. 


Proof: Since (3.25) is bounded above for k € S, f(c) is finite for all real c. 
Moreover f(c) is differentiable and 


fos -f PO’? |04, A)a(O4) dv) (3.27) 
Oa 


for all c. Hence (3.26) has exactly one solution. 
For all ke sS, k(04) < k(4;0) if g@4)—C>O0 and k(04) > k(04;©) if 
(84) —c < 0. Hence 


f [g(04) — cIp(y? | 4, Aka) dv(04) < 0, 
Oa 


and 


[ k(@a)ply® | @4)g@a) dv) Í k0a; Op |0a)804)dv(0a) 
OA et SEA = ¢; 


f ks) ply’ |8a)dv(O4) f ka; ply? | 84) dv@Oq) 
Oa O, 


Because Theorem 3.3.1 remains valid with the formal substitution p(y? | 
04, A) = 1, it provides bounds on prior expectations in a density ratio class of 
prior distributions as well. 


Example 3.3.1 Density Ratio Bounds for the Normal Density Suppose that 
there is a single unobservable 0,4, p(y° |04, A)a(@,4) is the kernel of the stan- 
dard normal distribution, and b(04)/a(04) =r > 1. For g(@,4) = 04, (3.26) then 
becomes 


i) (0a —c)@(@4) dO4 + rf (84 —c)PO@4) d04 = 0, (3.28) 
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where @(-) is the pdf of the standard normal distribution. Denoting the cdf of the 
standard normal distribution by ®(-) and using the relation A œ <P(Z) dz = —G(c) 
[Johnson et al. (1994), (13.134)], it follows from (3.28) that 


(r — DIO) + c®(c)] —re = 0. 


The unique solution of this equation, c = y(r), is displayed in the solid line in 
Figure 3.1. 


The function y(r) provides some guidance in choosing r in this and similar 
density ratio classes. For example, r = 10 permits the prior mean of any one 
parameter to shift up or down by about 0.9 prior standard deviation. To allow a 
shift of 1.5 prior standard deviations in a prior mean requires r = 52.3. Larger 
shifts in the prior mean require very large values of r because the tails of the 
normal distribution decline rapidly. 


Example 3.3.2 Density Ratio Bounds for Student-t Densities In the same sit- 
uation as Example 3.3.1 suppose instead that p(y? | 04, A)a(@4) is the kernel of 
the standard Student-t distribution with à > 1 degrees of freedom. Then c is the 


3 T T T T T T T T T 
2.5 4 
2 4 
15 enone tee ed 
1 4 
0.5 — xr) J 

— - y*(r;15) 

—- y*(r;6) 

- y*(r;3) 
0 5 10 15 20 25 30 35 40 45 50 


Density class ratio r 


Figure 3.1. Gamma functions for normal, t(15), t (6), and ¢(3) distributions. 
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root of 
Č CO 
/ (04 —c)t Ga; d) dO4 +rf (04 —c)t(O43 4) dO4 =0, 
00 c 


where t(-; à) is the pdf of the standard Student-r distribution with A degrees of 
freedom. The unique solution of this equation, c = y*(r; à), is also shown in 
Figure 3.1 for several values of A. Note that, for small à, a given value of r 
permits a much larger shift for this set of prior distributions than does the normal 
case. This is because the tails of the Student-t density with low degrees of freedom 
are much thicker than those of the normal density. 


Exercise 3.3.1 Completing the Argument Derive (3.27). 


Exercise 3.3.2 Extending Examples 3.3.1 and 3.3.2 In this exercise, the random 
variable x represents any unobservable, for example, x = g(@4). Its distribution 
could be the posterior, the prior, or some other distribution. 


(a) Let a(x) be any density kernel of the N (u, h7!) distribution, and b(x) = 
r-a(x), where r > 1. Show that if the random variable x has probability 
density kernel k(x) and a(x) < k(x) < b(x), then E(x) < u + h™!?y (r). 

(b) In the same situation as (a), suppose instead that a(x) is any density kernel 
of the Student-r distribution with A > 1 degrees of freedom. Show that 
E(x) <p+h'?y*(r; A). 


3.4 ASYMPTOTIC ANALYSIS 


Asymptotic analysis addresses properties of the limiting behavior of a posterior 
density p(04 | Y7, A) as sample size T — oo. These properties depend on the 
behavior of the sequence Y% = {y{,..., y7}. We shall assume in this section only 
that Y% is the observed value of a random vector Yy with probability density p(Yr | 
D), where D is the data-generating process. The vectors 04 and Yy are random, 
with density p(Yr | D)p(@4 | Yr, A). We do not assume that there necessarily 
exists any 04 € ©4 such that p(Y7 | 04, A) = p(Yr | D). In asymptotic analysis, 
0,4 and Yr appear repeatedly in circumstances where one or the other can be either 
a random vector or the argument of a function. For clarity we shall adopt the 
convention, in this section only, of using a tilde to distinguish a random vector 
from the argument of a function. Thus, for example, P(Yr € C | D) = Je D(Yr | 
D)dv(Yr). Finally, we shall assume in this section only that the dimension of the 
ka x 1 vector 04 is fixed and does not depend on sample size. 

The case in which 0 4 is discrete provides one of the most important conclusions 
of asymptotic analysis without the technical conditions required in the case of 
continuous 6 4. 


Theorem 3.4.1 Asymptotic Concentration of the Posterior Distribution for a 
Discrete Parameter Vector Suppose that 
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1. The vector of unobservables 04 has a discrete prior distribution over a finite 
set of n points 64;, and p(@4; | A) > O@ =1,...,n). 

2. T! log p(Yr | 04;, A) > €(04;; A) (G =1,..., 7). 

3. For one j € {1,...,}, €(0aj;; A) > €(04;; A) for all i =1,...,n,1 Fj. 


Then ‘. z 
jim P(04 =9,4; | Yr, A) =1. (3.29) 
—0o 


Proof: For all i Æ j 
T'loglp@ai | Yr, A)/p@aj | Yr, A] = T loglp(O 4; | A)/p Oa; | AD] 
+T~'loglp(¥r | 04i, A)/p(¥r | O4j, A] > €@ai; A) — £04); A). m 


This result does not require that y, be > i.i.d., either in pr | D) or in any of p(Yr | 
Oai, A) (i = 1,...,n). However, if Yr is i.i.d. in both D and A, then condition 2 
of Theorem 3.4.1 may be restated in terms of Kullback—Leibler information. [On 
the wider significance of Kullback—Leibler information, see Mittelhammer et al. 
(2000), Section 13.1.1.] 


Definition 3.4.1 Given two densities p(y | A) and p(y | B) for the same 


observable y and defined with respect to the same measure v, the Kullback—Leibler 
information criterion (KLIC) distance from p(y | A) to p(y | B) is 


KIp(y | A), py | B)] = f logip(y | A)/py | B)Ipvy | A) duty) 
= E{log[p(y|A)/p(y|B)] | A}. 
Note that the KLIC distance is directed 
KI p(y | A), py | B)] # KI py | B), ply | ADI, 


and one can be finite while the other is infinite. Clearly K[p(y | A), p(y | A)] = 0, 
and by Jensen’s inequality for a convex function 


K[p(y | A), p(y | B)] = —Eflogtp(y|B)/p(ylA)] | A} 
> —log{E[p(y|B)/p(ylA)] | A} = — log(1) = 0. 


Condition 3 of Theorem 3.4.1 may now be restated, for the case of i.i.d. distri- 
butions, as 


Eflog p(l 04;, A) | D] > Ellog p(y| 4;, A) | D] 
ef log ply | 94;, A)pty | D) dv) > f log p(y | 0 ai, A)p(y | D) dv(y) 
y y 


<= KIpO | D), py | aj, A)] < KIpO | D), pO | 81i, AD] 
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for each i Æ j. More succinctly, as sample size increases, the posterior distribution 
places all probability on the parameter vector 04; that provides the smallest KLIC 
distance from the data-generating density p(y | D) to the model density p(y | 
Oai, A) =1,...,n). Conclusion (3.29) of Theorem 3.4.1 is often summarized 
by saying that 04; is the pseudotrue value of 04. 


Example 3.4.1 Asymptotic Concentration in the Bernoulli Model with Discrete 
Parameter Suppose that in model A, y, is an i.i.d. Bernoulli random variable 
with P; = 1 | p, A) = p. The prior distribution places positive probability on 
only the three points p = py = 7 P=p2r.= $, and p = p3 = å. Suppose that 
in the data-generating process D, Ņ, is an i.i.d. Bernoulli random variable with 
P, = 1 | D) = p*, and p* € (0, 1). Then 


Eflog p; | pj, A) | D] = p* log pj + (1 — p*) log(1 — pj). 


One can show that P(p=1/2| Yr, A) 23 1 if and only if p* €e (0.36907, 
0.63093). 


When the unobservables 04 are continuously distributed, the posterior proba- 
bility attached to any single point is always zero, for each 04 € ©, and for all T, 
and consequently this is true for each 04 € ©, in the limit as well. In this case it 
is useful to frame asymptotic analysis in terms of limiting probabilities of a neigh- 
borhood of a point 0% with the distinguishing features indicated in the following 
result. 


Theorem 3.4.2 Asymptotic Concentration of the Posterior Distribution for a 
Continuous Parameter Vector Suppose that 


1. The prior distribution of 04 is absolutely continuous and P(O4 eC|A)>0 
for all C C ©, for which f- d04 > 0. 


S log p(Yr 104, A) 23 (04; A) uniformly for all 04 € O4. 

3. (04; A) is a continuous function of 0,4 with a unique global mode at 60, = 
0%, and there exist £ and £ for which M(0%,) = {04 : £ < €(04) < £} isa 
bounded open neighborhood of 0%. 


Then for any open neighborhood N (0%) of 6%, 


Jim P04 € N(0%) | Yr, A]=1. (3.30) 


Proof: Define G(6%,) = N(04) N M(6%). Let G04) = QO, — G04), lo = 
£(0%,; A), and £; = SUPGEO%) £(6,4; A). By virtue of condition 3, lọ > £1. Let 4&2 = 
(£o + £1)/2 and define 


H(6%) = G03) N {04 : L04) > b). 
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Then 


Pl04 € N(O%) | Yr, A] = Pa € G@%) | Yr, Al 
P[O, € N(0%) | Yr, A] Pl, € H(0%) | Yr, Al 


- Pia € G03) | A] SUPoego) P(Yr | Oa, A) 
P[04 € H(0%)| Al info,cu@%) P(Yr | Oa, A) 
(3.31) 
From the continuity of £ and condition 1 P[04 € H (0%) | A] > 0. Then, from the 
almost sure uniform convergence of T~! log p(Yr |04, A) 


ities [es P(Yr |04, A) 


i = S lh <O. 
infg,<H%) P(Yr |04, A) 


Consequently the almost sure limit of (3.31) is 0. E 


Condition (3.30) of this theorem is often summarized by saying that 0% is the 
pseudotrue value of 04. If y, is iid. in both D and A, then 


K{[p(y | D), p(y | 04, A)] < KIp(y | D), ply | @4, A)] 


for all 04 € ©, except 0%. Condition 2 is key in applying Theorem 3.4.2. This 
condition is seldom satisfied for a natural parameter space ©4. Instead, the param- 
eter space must be further restricted to a closed and bounded subset ©% of Oy. 
Then, showing that p~ log p(Yr |104, A) = £(04; A) for all 04 € ©% is suffi- 
cient for condition 2. (Condition 3 requires that 0% be an interior point of ©%,.) 
Non-Bayesian approaches to consistency of point estimates encounter a similar 
technical complication; see, for example, Amemiya (1985) Theorems 4.2.1 and 
4.2.2. 


Example 3.4.2 Asymptotic Concentration in the Bernoulli Model with Con- 
tinuous Parameter In the i.i.d. Bernoulli setting of Example 3.4.1, but with a 
continuous prior distribution, we have 


T~! log p(Yr | p, A) > p*log(p) + (1 — p*) log(1 — p), 


where p* is the value of p in the i.i.d. Bernoulli data-generating process D. If 
the prior distribution has support (pı, p2) with 0 < py < p* < p2 < 1, then the 
conditions of Theorem 3.4.2 are satisfied and P[p € (p* — £, p* + £) | Yr, A] ay 
1 for all € > 0. 


Example 3.4.3 Asymptotic Concentration in the Normal Linear Regression 
Model Suppose that, as in Example 2.1.2, the conditional distribution of observ- 
ables specified by model A is y|(h, B, X,A) ~ N(XB, h7'I,). Suppose that in the 
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data-generating process D 


~~ yX PA o 
TUT LNS Bs zi (3.32) 
Xy XX Oxy Xxx 


a positive definite matrix. D is otherwise left unspecified, and it is not necessarily 
the case that the distribution of y|(X,D) is normal, or that the observables y, are 
independent conditional on X. Then 


log p(y | B, h, X, A) = [-T log(27) + T logh — h(y'y—2y'Xß + B'X'XB)]/2 
and 
T~| log p(¥ | B, h, X,A) > [—log(2x) + logh — h(oyy — 20yxB + B'ExxB)1/2. 


The unique global maximum of the latter function occurs at the point B* = 
Dy Oxy, h* = (0 yy — o yx Ez Oxy). Condition 1 of Theorem 3.4.2 is satisfied if 
the prior distribution of B and h is absolutely continuous, condition 2 is satisfied 
if the support of p(B,h | A) bounds h and f’B from above, and condition 3 is 
satisfied if this support includes the point (B*, h*). 


Given further regularity conditions, the posterior distribution, appropriately 
scaled, converges in distribution to a normal distribution as sample size T increases. 
The regularity conditions in the following result, due to Chen (1985), are simi- 
lar to typical conditions for the asymptotic sampling-theoretic distribution of the 
maximum likelihood estimator; for example, see Amemiya (1985), Theorem 4.2.4. 
In fact, these regularity conditions can be weakened, to include cases involving 
nonstationary time series in which the limiting sampling theoretic distributions of 
maximum likelihood estimators are not normal; see Sweeting and Adekola (1987) 
for one such development and Sims and Uhlig (1991) for a significant application 
in time series econometrics. 


Theorem 3.4.3 Asymptotic Posterior Distribution for a Continuous Param- 
eter Vector Suppose that for all T, log p(@4 | Yr, A) is twice differentiable 
for all 04 € ©, and Yr € Wr. Denote Lr(04) = log p(04 | Yr, A), L7 (04) = 
dL7(04)/004 and L7 (04) = d°>L7(04)/304 06’,. Suppose that with probability 1 
there exists finite T* such that 


1. For all T > T*, L7 (04) has a strict local maximum at 047 = 0%, at which 
point L’(0%,,) = 0 and L} (0%r) is a negative definite matrix. 

2. For all T > T*, the largest eigenvalue oF of the positive definite matrix 
Er = —[L} (Ož)! satisfies the condition lim7— o o3, =0. 

3. Given any € > 0 there exists T (e) and ô(£) > O such that for all T > T (e) 
and N E {04 $ (04 5 O%r) Oa = O%r) < d(e)}, 


I, — Bee) < L40 DIL Ol < I +Be) 
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where B(e) is a positive semidefinite matrix whose largest eigenvalue tends 
to zero as € — 0. 
4. There exists a point 0% € ©, such that for any open neighborhood N (6%) 
of 0% 4, s 
jim P64 ce N(64) | Yr, A] = 1. 


For all T > T*, let Be be any ką x ką matrix such that (ae Oya = 
Lr ' let O47 be the random vector corresponding to the pdf p(@,4 | Yr, A), and 


let Žr = E7”? (@ ar — 0%). Then Žr > NO, Ip,). 
Proof: See Chen (1985). E 


Condition 4 of this theorem is the conclusion of Theorem 3.4.2. Conditions 1, 
2, and 3 are more primitive and are often easier to verify; see Exercises 3.4.1(c) 
and 3.4.2(c). Theorem 3.4.3 does not play as vital a role in Bayesian analysis as 
do central limit theorems in non-Bayesian approaches. In non-Bayesian approaches 
central limit theorems provide the basis for approximate inference when, as is nearly 
always the case, the relevant exact sampling-theoretic distributions are unknown. 
In Bayesian analysis, the exact posterior distribution is known, in principle, in any 
complete model. Chapter 4 shows how posterior simulators can be used to reveal 
the posterior distribution. Theorem 3.4.3 provides the conditions under which this 
distribution will be approximately normal, and Theorem 3.4.2 provides an interpre- 
tation of the unobservables 04 on which the support of the posterior distribution 
is concentrated. 


Exercise 3.4.1 Asymptotic Analysis of the Exponential Observables Distribu- 
tion In an investigator’s model A, y, is iid. and p(y, |0, A) = 0 exp(—Oy) 
T,00)(y). In the data-generating process D, Ņ, is iid., P(¥, >0| D) =1, and 
E(y; | D) = u < œ. The investigator undertakes Bayesian inference using an 
iid sample Yr = {y, ((=1,...,7)} drawn from a population with density 
pty | D). 


(a) Show that T7! log pr |8, A) = £(0; A), and 6* = arg maxo €(0; A) = 
-1 
W`. 

(b) Provide further conditions sufficient for 0* to be the pseudotrue value of 0. 


(c) Show that given the conditions in (b), conditions 1, 2, and 3 of Theorem 
3.4.3 are also satisfied. 


Exercise 3.4.2 Asymptotic Analysis of Omitted Covariates An investigator’ s 
model A for observables is the normal linear regression model 


Y = BX, +è, Èr i: N (0, h`’). 
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In the data-generating process D, Y = y'K; + 6%, + n,, N ‘NO, j~!) and 


where Q* is positive definite. 
(a) Show that 
T~ log p(¥r | B, h, X,A) © €(B, h; y, 8, j, Q*, A). 


Find arg maxg,n £(ß, h; y, 6, j, Q*, A). 

(b) Provide further conditions sufficient for (B*, h*) to be the pseudotrue value 
of ($, h). 

(c) Show that given the conditions in (b), conditions 1, 2, and 3 of Theorem 
3.4.3 are also satisfied. 


3.5 THE LIKELIHOOD PRINCIPLE 


Suppose w = g (04). Then the posterior moment E[g(04) | y°, A] can be expressed 


Elw | y°, A] = Elg (04) | y°, A] -=f 80a)p@a | y°, A)dv(Oa) 


Oa 
— fo, 8GaL@asy’, A)POa | A) dv@Oa) 
~ Jo, L@asy’, A)p@4 | A)dv(O4) 


Consequently, in forming posterior moments, we never have recourse to the data 
beyond L(@4; y°, A). All information in the data about g(@4) is conveyed through 
the likelihood function. This result is a consequence of posterior conditioning. It can 
also be obtained from a different set of first principles, developed by Barnard (1949) 
and Fisher (1956) and fully exposited by Birnbaum (1962). Berger and Wolpert 
(1988) provide a thorough exposition of the topic and Poirier (1988) provides an 
introduction written specifically for economists. 

The likelihood principle, formally defined below, states that if two data-generat- 
ing mechanisms lead to the same likelihood function, then they convey exactly the 
same evidence about the unknown parameters. This is not a self-evident assertion, 
especially in the context of non-Bayesian statistics. The following example, which 
is generalized at the end of this section, illustrates this point. 


Example 3.5.1 Stopping Rules from Bernoulli Trials. Consider two coin-flip- 
ping experiments involving independent flips of a coin for which P (heads) = 6. 
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Let y, = 1 if a head occurs in trial ż and y, = 0 if a tail occurs. In experiment 1 the 
coin is flipped T times. The number of heads, hr, is a sufficient statistic because 


T 
Pome sor hy =] [Pa -0 Ser dar. 


t=1 


(The support of this distribution is all possible {y; a 1-) The distribution of hy itself 
is binomial: 


T 
Dihr |0) = ( ja -0T 
hr 


In experiment 2 the coin is flipped until m heads have been observed. The total 
number of flips, Tm, is a sufficient statistic because 


Tn 


POs +--+ ty 18) = [[00 -0 = 0a — ay —™, 


t=1 


(The support of this distribution is all possible {y,}*_, for which )~)_, y; = m and 
ys; = 1.) The distribution of T, itself is negative binomial: 


Tin i 1 L 
p(Tn |0) = ( Jona — 9) Tem, 
m-l 


Suppose that the number of heads in the two experiments turns out to be the same 
(hr = m) and the number of tails also turns out to be the same (T — hr = Tn — m). 
The likelihood principle asserts that in this case the conclusions about 6 must be 
the same in the two experiments. 


The formal development here will assert two basic principles, the weak suffi- 
ciency principle and the weak conditionality principle, and then show that the two 
together are equivalent to the likelihood principle. We may accept or reject either 
of the two basic principles, but if we accept them both, then we also accept the 
likelihood principle. If we reject the likelihood principle, then we also reject the 
weak sufficiency principle, the weak conditionality principle, or both. The formal 
development begins with two definitions. 


Definition 3.5.1 An experiment is characterized by the triplet 
E = {y, a, ply | 8a, A)}. 
Definition 3.5.2 The evidence about 04 arising from E and the realization 


y = y’, denoted Ev(E, y°), is any inference, conclusion, or report concerning 64 
based on E and y = y°. 
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Examples of evidence include point estimates, hypothesis tests, posterior distri- 
butions, and general discussions at the ends of scientific papers using data. 


Definition 3.5.3 Weak Sufficiency Principle Suppose that s(y; A) is a suf- 
ficient statistic in the experiment E = {y, 04, p(y | 04, A)}. Let two runs of the 
experiment result in realizations y? and y$, respectively, and suppose s(y{; A) = 
s(y5; A). Then Ev(E, y?) = Ev(E, ys). 


The reasonableness of the weak sufficiency principle is inherent in the ex post 
formulation of sufficiency (2.29), which, we have seen, is logically equivalent to 
the ex ante definition (2.28). The development of the weak conditionality principle 
is based on the concept of a mixed experiment. 


Definition 3.5.4 Given two experiments 


involving the same parameter vector 04, a mixed experiment, based on E; and 
E>, is Ex = [yx, 04, p(j, yj | 94, A)] where the random vector y, = (j, yj), A= 
(Aj, A2), and p(j, yj |04, A) = Spy; | 04, Aj). 


Thus in a mixed experiment p(j |04, A) = .5. Which experiment is actually 
conducted does not depend on 04. 


Definition 3.5.5 Weak Conditionality Principle. Consider two experiments, 
E; = {yj, 9a, p(y; | 94, Aj) }U = 1,2), as well as the mixed experiment E, = 
[Y< 0a; PO, yj | 94, A)]. Then Ev[E,, (j, y;)] = Ev(Ej, y;). 


Example 3.5.2 Assessing the Quality of a Scientific Paper The editor of a 
scientific journal wishes to learn about the quality, 04, of a scientific paper. He 
can send the paper to either referee A or referee B, each of whom has expertise in 
different areas relevant to the paper. The editor may decide to send the paper to 
A, to send it to B, or to flip a coin and send it to A if the coin is heads and to B 
if it is tails. The weak conditionality principle asserts that once the report from the 
known referee is in hand, the editor’s findings about the quality of the scientific 
paper will be the same whether he chose the referee deliberately or flipped a coin. 


Definition 3.5.6 Likelihood Principle. Consider the two experiments Ej; = 
ly;,9a, p(y; |94, Aj)](j = 1, 2). Suppose that for the particular realizations y? 
and y5, the respective likelihood functions satisfy 


L(04; yi, A1) x L(O4; y3, A2). 


Then Ev(E), y$) = Ev(E», y5). 
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Theorem 3.5.1 Equivalence of the Likelihood Principle, and the Weak Suffi- 
ciency and Weak Conditionality Principles The weak sufficiency principle and 
the weak conditionality principle are together equivalent to the likelihood principle. 


Proof: Assume the likelihood principle. The antecedents of the weak sufficiency 
principle are that s(y; A) is a sufficient statistic in the experiment E = [y, 04, p(y | 
04, A)] and that s(y?; A) = s(y$; A). By the factorization criterion 


L(Oa; ¥9, A) = pis(y3, A) | 0a, Alr(y3; A) Gi = 1,2). 


Hence L(04; y{, A) x L(0 4; y2, A), and by the likelihood principle Ev(E, y{) = 
Ev(E, y5), thus establishing the weak sufficiency principle. 
The mixed experiment E, = [(j, yj), 94, p(j, yj | 94, A)] has likelihood func- 
tion 
L@as j°, y3, A) & PU, Y% 10A, A) 


= -5P (Yo | 64, Ajo) x L(0 4; Yio, Ajo). 


Hence Ev[E,, (j, yj)] = Ev(£;, yj), thus establishing the weak conditionality 
principle. 

Taking up the converse, suppose that for the particular realizations y{ and 
y;, the likelihood functions from the two experiments satisfy L(04; y9, A1) « 
L(0 4; y3, Az). The proof proceeds by creating identical sufficient statistics based 
on a mixed experiment employing the weak conditionality principle; the weak 
sufficiency principle then gives the result: 


1. Define the mixed experiment E, as in the weak conditionality principle, 
from which we know Ev[E,, (j, yj)] = Ev[E;, y;]. Let y? and y5 be the 
two realizations in the antecedent of the likelihood principle. 

2. For the mixed experiment E, with random outcomes (j,y;), define the 
statistic 


C,y?) if j =2 and y = y3 
OU. y) =}. 
(j, yj) otherwise 
Thus the outcomes (1, y?) and (2, y5) result in the same values of Q. [The 
motivation in this construction is to deliberately blur the outcomes (1, y?) 
and (2, y5) and then show that we lose nothing.] 
3. Note that Q is sufficient for 0 4 from the definition (2.28): 


1 if (jy) =Q. 
0 otherwise j 


AG y;) | 2 #0, yD] = | 
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5p 10a AD) c 


pi, yp) | Q = (L, y9)] = = : 
i ! 5 Soi POGlOa, Aj) c+ 1 


PI, y2) | Q = (1, yD] = TET 


(In the last two equations c = p(y? | 04, A1)/P(y5 | 04, Az). The ratio does 


not involve 04, from the antecedent of the likelihood principle.) 


4. Since the sufficient statistic Q is the same regardless of whether the out- 
come is y? or y5, the weak sufficiency principle implies Ev(E), y9) = 
Ev(£o, y9). E 


The likelihood principle extends the conclusion of Example 3.5.1, to sequential 
experiments and stopping rules generally. 


Definition 3.5.7 A sequential experiment is an experiment in which the stop- 
ping time T is random. 


Definition 3.5.8 A stopping rule in a sequential experiment is a sequence of 
probabilities t = {t,}?°, in which Tọ is constant and 


Tm = P(T =m | Yn, Oa, A) = P(T =m | Yn, A) = Tam A). 


Note that the stopping probability may depend on the observables Y,,, but not on 
the unobservables 64. 


Corollary 3.5.1 Stopping Rule Corollary of the Likelihood Principle In 
a sequential experiment Ev[E,(T,Y7)] depends on (7T,Yr) only through 
L(@4; Yr, A). The likelihood principle implies that the stopping rule t is irrelevant. 


Proof 
T-1 
p(T, Yr |04, A) = (1 — To) [Ju —7(¥;; A)]tr (Yr; A)p(Yr |04, A). 


tal 


Hence L(04; T, Yr, A) « p(Yr |04, A). E 


Exercise 3.5.1 The Likelihood Principle, Conditioning, and Non-Bayesian 
Statistics (I) [From Berger and Wolpert (1988), Example 10, p 21.] Either of 
two experiments, E} or E2, can be undertaken to learn about a parameter 0. In 
each experiment a single random outcome y will be observed. The distribution of 
y depends on @ in each experiment, as follows: 
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P(y=11/@) 
0=0 900 
g=1 090 

P(y=11/@) 
9=0 260 
g=1 026 


Experiment A 
P(y =2|8) 


.050 
.055 


Experiment B 
PQ =2|8) 


.730 
803 
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P(y =3|8) 


050 
855 


P(y =3|8) 


.010 
.171 


(a) If you were able to choose which experiment to carry out, which one 


would you choose? Provide as much formal justification for your answer 
as you can. 


(b) Take the likelihood principle as given, and show that the outcome y = 


1 conveys the same evidence about 0 regardless of which experiment is 
performed. The same is true for y = 2, and the same is true for y = 3. 


(c) Is there a logical conflict between your answers to (a) and (b)? Why or 


why not? 


(d) Consider a non-Bayesian (classical) test that accepts 6 = 0 when y = 1 and 


decides 0 = 1 otherwise. Show that this is a most powerful test, with error 
probabilities (of type I and type II, respectively) .10 and .09 in experiment 
A and .74 and .026 in experiment B. (Recall what “most powerful” means 
in this context; any other test has either a higher probability of type I error, 
or a higher probability of type II error, or both.) 


Exercise 3.5.2 The Likelihood Principle, Conditioning, and Non-Bayesian 


Statistics (II) 


[From Berger and Wolpert (1988), Example 1, p 5.] Suppose that 


the random variables x,(t = 1,..., T) are independent, and P(x; = 0 — 1 | 0) = 
P(x, =0+1 | 0) = 5. 


(a) Experiment | consists of collecting a sample of size T = 2. Show that a 


C(x1, x2) = | 


75% confidence interval of smallest size for 6 is 


the point (x; + x2)/2 if xı Æ x2 
the point xı — 1 


if x, =x) 


[Thus, if used repeatedly, C (x1, x2) would contain @ with probability .75.] 
The evidence in experiment | is that C (x1, x2) constitutes a 75% confidence 
interval for 0. 


(b) In the context of (a), suppose that you observe x; = 2 and x2 = 4. Then the 


75% confidence interval in (a) is the point 3. Is this consistent with common 
sense about the reliability of the conclusion that 6 = 3? 
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(c) In experiment 2 x; is drawn repeatedly, and the experiment concludes with 
the first occurrence of x, Æ xı; thus the size, T, of the collected sample is 
random. The evidence in experiment 2 is that 6 = (x; + x7)/2 is a 100% 
confidence interval for 0. Is the evidence from experiments | and 2 consistent 
with the likelihood principle? Provide a formal answer. 


CHAPTER 4 


Posterior Simulation 


Bayesian inference requires that we be able to access the posterior distribution 
of the vector of interest œ in one or more models. In all except simple illus- 
trative cases this cannot be done analytically. This chapter describes algorithms 
for simulating a sequence {@”} whose distribution is closely related to the dis- 
tribution æ | (y’, A). The sequence {w?,...,@} can be used to approximate 
posterior moments of the form E[h(@) | y’, A] and Bayes actions of the form 
@ = arg minaca E[L(a, œ) | y’, A] arbitrarily well: the larger is M, the better is the 
approximation. Taken together, these algorithms are known generically as posterior 
simulation methods. The simplest possible relation between {w")} and p(w | y’, A) 


is ol”) Hit p(@ | y°, A). In this case it is possible to learn a great deal about the 
posterior distribution of œ, as detailed in the next section. In most models it is not 
known how to construct such a sequence. Fortunately there are more sophisticated 
posterior simulators that typically can be used. One approach is to find a distribu- 
tion that is similar to the posterior distribution, and from which i.i.d. drawings can 
be made. It is possible to correct for the difference in the simulation and posterior 
distributions, in such a way that posterior moments can be approximated arbitrarily 
well. Section 4.2 makes clear the sense in which the simulation and posterior dis- 
tributions must be similar, and details two kinds of corrections that can be made. 
Section 4.4 takes up some variants on these methods that can greatly increase 
the amount of information about the posterior distribution in a given number of 
simulations. 

As the dimension of the space in which simulations are carried out becomes 
large, it is often increasingly difficult to find a single distribution that is sufficiently 
similar to the posterior distribution that this approach is practical. A different class 
of simulation methods, known as Markov chain Monte Carlo (MCMC), constructs 
sequences that are neither independent nor identically distributed, but converge in 
distribution to the posterior distribution. These simulators are more sophisticated, 
and they make the posterior distribution accessible in a very large set of economet- 
ric and statistical models. Section 4.3 provides an informal introduction to these 
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methods, with a treatment in greater depth in Section 4.5. Section 4.6 takes up 
some combinations of these methods that widen the range of problems that can be 
attacked successfully by MCMC methods. Section 4.7 discusses the evaluation of 
the numerical accuracy of MCMC simulators. 

While our primary interest is in simulating the posterior distribution of a func- 
tion of interest h(@), the methods developed in this chapter can be used for any 
distribution. For example, they can be used to explore the implications of a prior 
distribution. To reflect this generality in the notation, this chapter takes the canon- 
ical simulation problem to be that of learning about the distribution of w, where 
w~ p(@|6,7) and @~ p(@| J), where J denotes the distribution of interest. In 
this formulation 0 € ©, œ € Q, p(@ | I) is any density with respect to a measure 
dv(@), and p(@ | 6, T) is any conditional density with respect to a measure dv(@). 


4.1 DIRECT SAMPLING 


Suppose that from the probability density p(@,@ | I) it is possible to simulate 
pairs of independent identically distributed (i.i.d.) drawings 0°” ~ p(@ | I) and 
o” ~ p(w |e), 1). An example of such a density is the posterior density in 
Example 2.3.3, the normal linear regression model with conjugate prior distribution. 
In that example the corresponding posterior distribution is represented as 


7h | (y’,X,A)~ xO), B| (uy, X, A) ~ N(B, hH’). 


The following result shows that it is possible to use the sequence {0} to approx- 
imate several aspects of the distribution of œw, including moments. 


Theorem 4.1.1 Approximation of Distributions and Moments by Direct 
Sampling Suppose that the sequence {0”, w} is i.i.d., with 0 € O, @™ e 
2,0 ~ p0 | 1), ando™ | 0° ~ p(w | 0, 1). Leth : Q > R! and consider 
several additional conditions: 


1. E[h(w) | 1] =h. 
2. var[h(@) | I] = 07. 
3. For given p € (0, 1), there is a unique h, such that the statements 


P[h(w) < hp | I] > p and P{h@) > hp |I]21—p 


are both true. 


4. The pdf p[h(@)|/] is continuous, and for the unique h, corresponding to p, 
Pih(@) = hp | I] > 90. 


Then 


(a) @™ ey P(@ | I), regardless of whether any of conditions 1—4 are true. 
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(b) Given condition 1, h” = M~! Y, h(wo™) $ h 
(c) Given conditions 1 and 2, M" 2g” = h) 4 N (0, o?) and 


A M 
GUM) _ aby! h(@™) — z pas ee 


m=1 


Let nm ) be any real number such that 


v 


T_oo,pnlh(@)] = p and 


m=1 


x 


M“ 7 1G go)lh(@)] = 1 - p. 
m=1 


Then 
(d) Given condition 3, h% ZS h p 
(e) Given conditions 3 and 4 


MRM — hp) $ N{0, p — p)/plh@) = hp | TP}. 


Proof: Conclusion (a) is just a consequence of the definitions of conditional, 
joint, and marginal probability. Conclusion (b) follows from the strong law of 
large numbers [see Casella and Berger (2002), Theorem 5.5.9] and (c), from the 
Lindeberg—Lévy central limit theorem (Casella and Berger 2002, Theorem 5.5.15). 
Conclusion (d) is immediate from 6f.2(i) in Rao (1965) and (e), from 6f.2(ii) in 
Rao (1965). a 


In the approximation of any quantity, some assessment of the error is essential. 
For the approximation of moments by direct sampling, the relevant assessment is 
given by conclusion (c) in Theorem 4.1.1. Given conditions 1 and 2, the approxi- 


mation he ~ NG, I/M) is valid for large M. In this case (62 / M)! is 
known as the numerical standard error (NSE) of h 


Example 4.1.1 Direct Sampling in the Normal Linear Regression Model with 
Conjugate Prior Distribution In the context of Example 2.3.3 suppose that the 
vector of interest w is the unobserved outcome y corresponding to a new experiment 
in which x = x*; thus w | (x*, ß, h, A) ~ N(B'x*, h7'). If 
PAO E CM), 
B™ | h™ ~ NIB, (AH), 
o™ | B”, h™) aS N(B"'x*, Amy, 
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then w”) ‘AS p(w | x*, y’, X, A), where B and H are as defined in (2.52) and 5° 
and PV are as defined in (2.56). 
From the sample {w", ..., œP} we may approximate 


P = P(w < a | x", y’, X, A) = E[I(—~co,a}(@) | x", y’, X, A] (4.1) 


by p™® = M71 Ica (@™). The approximate variance of this numerical 
approximation to p is p(1 — p)/M, and this variance may in turn be approx- 
imated by p“”(1 — p‘”)/M. (For an improvement on this approximation, see 
Exercise 4.4.1.) 

Suppose we were to estimate w. If the loss function is quadratic, then © = 
E(w | x*, y’, X, A) can be approximated by M~! °“_, w. If the loss function 
is Lœ, w) = (1 — g)(@ — @)I(~-0,31(@) + q4 (@ — ©)IG,.0)(@), then © is quantile 
q of the w | (x*, y’, X, A) distribution: P(w < @ | x*, y’, X, A) = q. The estimate 
© can be approximated by the corresponding quantile of {@°”}”_,: the values w 
will in general all be different, and we choose that œ™”" such that the fraction of 
ow” less than or equal to w"”) is at least q and the fraction of w” greater than 
or equal to w"”) is at least 1 — q. 


Theorem 4.1.1 can be used to solve Bayesian decision and estimation problems 
for specific loss functions. If the loss function is quadratic (Definition 2.4.3) or 
linear—exponential (Exercise 2.4.3), the Bayes action is a posterior moment and 
conclusions (b) and (c) are relevant. If the loss function is linear—linear (Defini- 
tion 2.4.4), then conclusions (d) and (e) are relevant. In general, however, Bayesian 
decision problems need not reduce to posterior moments or quantiles. The follow- 
ing result applies when the loss function L(a, œ) is a smooth function of the Bayes 
action a. 


Theorem 4.1.2 Approximation of Bayes Actions by Direct Sampling Sup- 
pose that the sequence {0°,@} is iid., with 0° ~ p(0 |I) and wo” | 
0,1) ~ pw |e, I). Let L(a, œ) > 0 be a loss function defined on A x Q, 
where A is an open subset of R”. Suppose that the risk function 


ra) =f f La ope | Np |6, 1d deo 
aJe 


has a strict global minimum ata € A C R”. Consider several additional conditions, 
for a suitably defined open neighborhood of à, N (a): 


1. M-!Y¥_ L(a, @™) converges uniformly to R(a) for all a € N(a), almost 
surely. 

2. dL (a, @)/da exists and is a continuous function of a, for all œ € Q and all 
ac N(a). 

3. ə L(a, @)/0a da’ exists and is a continuous function of a, for all @ € Q and 
alla c N(a). 
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4. B = var[dL(a, w)/da|,_z | Z] exists and is finite. 
5. H = E[d*L (a, w)/da ða'la—ş | I] exists and is finite and nonsingular. 


6. For any € > 0, there exists M, such that 


P[ sup |d°L(a, @)/da;8aj;day| < M: | I] > 1—e 
) 


acN(a 
for alli, j,k =1,...,m. 


Let Ay be the set of all roots of M~! ye dL (a, @”)/da = 0. Then 


m=1 
(a) Given conditions 1 and 2, for any € > 0 


lim P[ inf (a—a)'(a—@) >e |I] =0. (4.2) 
M->co acAy 


(b) Given conditions 1—6, if @y is any element of Ay such that ay +. @ then 
O M'2(a@y —a) S NO, HBH’). 
Gi) M7! Z", aL(a, @)/daly_g, - AL(a, @™)/ða' lai, S B. 
(iii) M~! Y”, 82L(a, o™)/ða ða la-a, > H. 


Proof: Result (a) follows from Amemiya (1985), Theorem 4.1.2. Result 
(b) follows from Amemiya (1985), Theorems 4.1.3 and 4.1.4. E 


Theorem 4.1.2 is widely and readily applicable. The conditions can be verified 
directly in most cases. Beyond the posterior simulator the computations require 
coding of the loss function and its first two derivatives. Once this is accomplished, 
conventional and widely available optimization software can be applied directly to 
the function Ry(a) = M7! Y*_, L(a, @”) of a. We can show that Ry @m) 5 
Rm@@), and in fact [see Shao (1989)] Ru @m) — Ru (a) = O (M7! log — log M). 
The caveats generally applicable to numerical optimization are relevant here. Unless 
L(a, œw) is known to be concave, a local minimum need not be a global one. The 
result is the posterior simulation approximation ay of the Bayes action à. The usual 
advice, to iterate to a minimum from alternative starting values, applies here, but 
in posterior simulation we also have the alternative of solving the problem with a 
larger sample size M. Result (b) provides numerical standard errors for assessing 
the accuracy of the approximation that are valid for large M. 


Exercise 4.1.1 The Inverse CDF Method This method applies, in principle, to 
any random variable for which it is possible to compute the inverse of the cumu- 
lative distribution. It is generally limited to univariate distributions. Its efficiency, 
relative to other methods described in the next section, depends on the time required 
to compute the inverse cdf. 
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(a) Suppose that a random variable has cdf F(x) and F(x) = p if and only 
if x = F~'(p). Show that if u is uniformly distributed on [0, 1], then 
x = F`! (u) has cdf F(x). 

(b) How would you use the inverse cdf method to simulate an exponentially 
distributed random variable with mean u? 


Exercise 4.1.2 Simulation from the Bivariate Normal Distribution Suppose 
that 


(x? + y?)!/? if x2 + y? <1; 


x,y) = 
LY) os 


Furthermore, (x, y) ~ N(j, £); and È are known. In what follows 
pe) Bele ah 
H2 012 02 


eG =1,2;m=1,2,...) 


and 


are mutually independent, standard normal random variables. 


(a) Show that E[ f(x, y)] and var[ f(x, y)] are finite. 
(b) Consider the simulation 


x” = u ae”, 


y™ = m + Ov/ouda™ — u) + On —07,/o) ek”. 


Show that 


M 
DES fa™, y™) El POI), 


m=1 


(For continuation, see Exercise 4.5.1.) 


4.2 ACCEPTANCE AND IMPORTANCE SAMPLING 

Suppose that we cannot derive a method for drawing i.id. random vectors 6” 
directly from the density p(@ | I) but we can simulate i.i.d. drawings from a density 
p(@ | S) that is similar to p(@ | I). Then it may be possible to learn about many 
aspects of the vector of interest w, but the sense in which p(@ | S) is similar to 
p(@ | I) is critical. We consider two approaches here: acceptance sampling and 
importance sampling. Section 4.3.2 discusses a closely related third approach, the 
independence Metropolis chain. 
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4.2.1 Acceptance Sampling 


Acceptance sampling may be used to learn about any distribution of 0 with generic 
density p(@ | I), given a source density p(@ | S) from which i.i.d. random variables 
can be drawn. For the methods discussed in this section, it is necessary that we 
be able to evaluate an arbitrary kernel k(0 | S) of p(@ | S), and an arbitrary kernel 
k(0 | 1) of p(@ | D). 

Figure 4.1 provides the intuition of acceptance sampling. The heavy solid curve 
represents the density of interest p(@ | Z) and the lighter solid curve, the source 
density p(@ | S). The ratio p(@ | J)/p(@ | S) is bounded above by a constant a. In 
Figure 4.1, p(1.16 | 1)/p(1.16 | S) = a = 1.86, and the dotted curve is a - p(@ | S). 
The idea is to draw 0* from p(@ | S), and accept the draw with probability p(0* | 
I)/la- p(0* | S)]. For example, if 0* = 0, then the draw is accepted with probability 
.269, whereas if 0* = 1.16, then the draw is accepted with probability 1. The accepted 
values in fact simulate i.i.d. drawings from the density of interest p(0 | I). 


Theorem 4.2.1 Acceptance Sampling Let k(0 | T) = c; - p(@ | T) be a kernel 
of the density of interest p(@ | J), and let k(0 | S) = cs - p(0 | S) be a kernel of 
the source density p(@ | S). Let r = supgeg k(O | 1)/k(O | S) < oo. Suppose that 
6” is drawn as follows: 


1. Draw u uniform on [0, 1]. 
2. Draw 6* ~ p(@ | S). 


0.5 T T aly T T 
TOR — pOT 
0.45 H k 4 — p@|s) | 4 
me le a*p(@| S) 
0.4 


0.35 


0.3 


0.25 


0.2 


0.15 


0.1 


0.05 


0 = _ 
6 -4 22 0 2 4 6 


Figure 4.1. Acceptance sampling. 
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3. If u > k(0* | 1)/[r - k(0* | S)], then return to step 1. 
4. Set 0” = 6%. 


If the draws in steps 1 and 2 are independent, then 0° ~ p(@ | 1). 


Proof: Let ©* denote the support of p(@ |S); r < co implies © C ©*. The 
unconditional probability of proceeding from step 3 to step 4 is 


[we | D/irk@ | Sp | S)dv@) = c;/res. (4.3) 


Let A be any subset of ©. The unconditional probability of proceeding from step 
3 to step 4 with 0 € A is 


[xe | )/[rk@ | Olp | S)dv@) = [xe | Z) dv@)/rcs. (4.4) 


The probability that ð € A, conditional on proceeding from step 3 to step 4, is the 
ratio of (4.4) to (4.3), which is f, k(0 | 1) dv(0)/c; = f, p@ | T) dv@) = P@« 
AJD. E 


The proof of Theorem 4.2.1 provides the key to the efficiency of this algo- 
rithm. Regardless of the choices of kernels, the unconditional probability in (4.3) 
is c;/rcs = infsco p(0 | S)/p(0 | T) = a7!. If we wish to generate M draws of 0 
using acceptance sampling, the expected number of times we will have to draw u, 
draw 0*, and compute k(0* | T)/[r - k(0* | S)] is M - supgeg P(O | D/p0 | S) = 
M - a. The computational efficiency of the algorithm is driven by those 0 for which 
p(0 | S) has the most severe undersampling relative to p(0 | I). In most applica- 
tions the time-consuming part of the algorithm is the evaluation of the kernels 
k(0 | S) and k(@ | I), especially the latter. [If p(@ | I) is a posterior density, then 
evaluation of k(@ | I) entails computing the likelihood function.] In such cases 
a = SUpgee K(O | 1)/k(@ | S) is indeed the relevant measure of inefficiency. 

The retained values of 0 that constitute {9} are independent, and all are drawn 
from the distribution with density p(@ | 7). If we also draw o” ~ p(w | 0, 1), 
then Theorem 4.1.1 applies directly to the sequence (0, @™}. 


Example 4.2.1 Restricted Normal Linear Regression Model Suppose that the 
coefficient vector B in the normal linear regression model is restricted to a set 
S C RÝ. For example, the signs of coefficients might be restricted, or functional 
form restrictions might limit the range of 8. To consider a simple case, suppose 
that precision h is fixed and 


p(B | A) x exp[—(B — B)’H(B — B)/2]1s(B). 
Then 


p(B | y’, X, A) x exp[—(B — BY H(B — B)/21/s(B) 
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with H as defined in (2.19) and B in (2.20). If the source distribution for acceptance 
sampling is N(B, H), then a draw is accepted if B € S and rejected if B ¢ S. The 
fraction accepted is the posterior probability that 8 € S when the prior distribution 
for B is N(B, H~'), and there are no restrictions. For more details and applications, 
see Geweke (1986), which develops this method in the context of the improper prior 
distribution of Exercise 3.2.1. 

The lower the acceptance probability, the lower is the computational efficiency 
of the method. In fact, the acceptance probability can be so low that the algorithm 
effectively halts computations. This is most likely to be a problem when k is large 
and acceptance sampling is incorporated in the algorithms described in Section 4.3. 
In general it is never wise simply to assume that inequality constraints can be 
handled effectively by acceptance sampling. In specific cases superior alternatives 
are available. Once such alternative is developed in Section 5.3, building on the 
algorithm in the next example. 


Example 4.2.2 Acceptance Sampling for the Truncated Univariate Normal 
Distribution Tailoring the source density to the density of interest can be criti- 
cal to the effectiveness of acceptance sampling. A problem that arises frequently 
is that of sampling from a standard normal distribution truncated to the interval 
(a, b). If the source density is normal, then the acceptance probability is ®(b) — 
(a), which can be quite small. The inverse cdf method [see Exercise 4.1.1(a)] 
involves finding the root of a nonlinear equation that itself requires evaluation 
of an integral. Moreover, for sufficiently large yet finite values of a, any numer- 
ical integration routine returns ®~'(a) = 1, thus making it impossible to draw 
from the normal distribution truncated to (a, oo). An efficient algorithm described 
in Geweke (1991) applies acceptance sampling with the source distribution as 
follows: 


Characteristics of the Truncation Source Distribution 

a<0<b<o@w a>-—t, and b < ti Uniform(a, b) 
a<—tob>t N (0, 1) 

O0<a<b< œ f@/f(b) <b Uniform(a, b) 
f(a)/f(b) > tb a<t IN (0, 1)| 

a> ts a+exp(a7!) 
b = œ a < ta N (0, 1) 

a>t a + exp(a™!) 


where f(x) = exp(—x?/2), and the constants t j are part of the design of the algo- 
rithm. The cases —oo <a <b <0 and a = —oo require only changes in sign. 
Geweke (1991) suggests t = .375, h = 2.18, t3 = .725, and t4 = .45 and reports 
that the method requires from one-sixth to one-half the time of the inverse cdf 
method, depending on the configuration of a and b. 
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4.2.2 Importance Sampling 


Rather than accept only a fraction of the draws from the source density, it is possible 
to retain all of them, and consistently approximate E[h(@) | I] by appropriately 
weighting the draws. The probability density function of the source distribution 
is then called the importance sampling density, a term due to Hammersly and 
Handscomb (1964), who were among the first to propose the method. It appears 
to have been introduced to the econometrics literature by Kloek and van Dijk 
(1978). As with acceptance sampling, denote the source density by p(@ | S) and an 
arbitrary kernel of the source density by k(@ | S) = cs - p(@ | S) for any cs > 0. 
Denote an arbitrary kernel of the density of interest by k(@ | I) = cr - p(@ | I) for 
any cr > 0. The following result is similar to Geweke (1989a), Theorem 2. 


Theorem 4.2.2 Approximation of Moments by Importance Sampling Sup- 
pose that the sequence {0% , @™} is independent and identically distributed, with 
6” ~ p@ | S) and wo” ~ p(w | 6”, I). Define the weighting function w(@) = 
k(0 | 1)/k(@ | S), let h : Q —> R', and consider several additional conditions: 


. E[h(@) | I] =h exists. 

. var[h(@) | I] = o? exists. 

. The support of p(@ | S) includes ©. 
. w(0) is bounded above on ©. 


hw Nw = 


Then 


(a) Given conditions 1 and 3 


M 
—(M) ae who”) a.s. 
= ——— a > 


h = F. (4.5) 
ba P w(0™) 


(b) Given conditions 1—4 
M'A —hy S NO, 72) 
and 


us m 7M) 42 (m)\2 
M [h@”)-—h Pwo”) 
TM= UE PCa e Se. (4.6) 


[ower] 


Proof: The sequence {w”} is i.i.d., and from conditions 1 and 3, 


KO | 1) NE 
KO PO | S)dv(0) = i =W. 


E(w) | S] = 
© 
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By the strong law of large numbers, we obtain 
M 
T” = M'Y wo”) Sw. (4.7) 
m=1 


The sequence {w(0), h(w)} is also ii.d., and 


E(w@)h(@) | S] > w(0) p h(@) p(@ |9, Davo) p@ | S)dv@) 
© Q 


= Cres) | [roro 10, Dp@ | T) dv(@) dv@) 
© 
= (cı/cs)E[h(w) | I] =W -h. 


By the strong law of large numbers 


M 
D L my! VO wO™)h@™) S T.A. (4.8) 


m=1 


wh 


Since the fraction in (4.5) is the ratio of the left side of (4.8) to the left side of 
(4.7), (a) is established. 
Turning to (b), first note that 


E[w(6)?h(@)* | S] = f w()? | | hwy? pw 18, Ddvo)| pO | S)dv(@) 
© Q 


-f Gy | f roro | 8, Ddo) pO | S)dv(0) 
o k@|S) [Jo 


leei Í w(8) IOR 19, navo) pO | 1) dve). 


(4.9) 
Condition 4 bounds (4.9) by 


(cr/cs)EIh(@) | 1] e w(0) (4.10) 


cO 


and from condition 2 (4.10) is finite. Taking the specific case h(w) = 1, it follows 
that E[w(0)? | S] < œ as well. Hence 


= w(0) 
V = var A | s| 
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is finite, and excepting the trivial case in which h(@) is almost surely constant, V 
is nonsingular. By the Lindeberg—Lévy central limit theorem 


—(M) = 
Wie (Sn) = (=) | 4, N(0,V). 


Utilizing the asymptotic (in M) expansion known as the delta method (Casella and 
Berger 2002, Section 5.5.4; Greene 2003, Section D.2.7), since 


— (mM) 7 Ty 


wh wh wh -Uh Wh. W” —D) a 
wo et o — py h) 
we have 


where 


t? =W {var[w(0)h(@) | S] — 2hcov[w(0)h(w), w(0) | S] 


+7 var[w(0) | SI} = T° var{[wO)h(w) — hw(6)] | S}. 
This is consistently approximated by 
MIY OMA) -7 wO™)P 
M 2 
L (m) 
[m= Y wo) | 
which is equivalent to (4.6). E 


An apparent attraction of importance sampling, relative to acceptance sampling, 
is that it is formally easier to apply in approximating moments. It is necessary to 
establish condition 1 of Theorem 4.2.2 regardless of the method of evaluation, and 
whether condition 3 holds should be immediately apparent. By contrast, acceptance 
sampling requires that the upper bound of the weight function w(@) be determined 
before a simulation consistent approximation to a moment can be constructed. The 
analog of this condition in Theorem 4.2.2 is condition 4, but note that this condition 
is needed only for the evaluation of numerical accuracy. Furthermore, if w(0) has a 
known bound, then condition 4 holds, but for condition 4 to hold it is not necessary 
to know the bound. As a practical matter, however, if condition 4 does not hold, 
then convergence is typically quite slow, and the inability to evaluate numerical 
accuracy renders numerical approximations unreliable. Thus a practical advantage 
of importance sampling, as opposed to acceptance sampling, is that the former is 
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practical if we establish the existence of an upper bound for w(@), whereas in the 
latter we must evaluate this bound. 


Example 4.2.3 Reweighting to a Different Prior Distribution Suppose that we 
have available an i.i.d. sample from a posterior density p(@4 | y°, A1), correspond- 
ing to model A, with prior density p(@4 | Ai). Now suppose that we wish to 
investigate a second model A, with the same observables density but prior density 
p(0a | Az). If p(O4 | A2)/p(0 4 | A1) is bounded above on ©4, then p(04 | y°, Aj) 
is an importance sampling density for p(@4 | y°, A2) with conditions 3 and 4 of 
Theorem 4.2.2 satisfied. The weight function is p(@4 | A2)/p(@4 | A1). This pro- 
cedure is sometimes called “reweighting of the posterior simulation sample.” See 
Section 8.4 for further discussion. 


Example 4.2.4 A Hybrid Acceptance and Importance Sampling Algorithm 
Given density of interest p(@ | 7) and source density p(@ | S), suppose that it is 
known that p(@ | J)/p(@ | S) is bounded on © but the bound is unknown. Define 
the importance sampling density p(@ | a, S) with kernel 


p@ | 1) if p@|1)/p@|S)<a 


k(0 |a, S) = . ; 
p@ | S) if p@|1)/p@|S)>a 

Applying Theorem 4.2.1 to the density of interest kernel k(@ | a, S) and source den- 
sity p(@ | S), we see that supọco[k(0 | a, S)/p(@ | S)] < max(a, 1), and therefore 
i.i.d. draws from p(@ |a, S) are possible. Importance sampling (Theorem 4.2.2) 
applies to the density of interest p(@ | I) and importance sampling density p(0 | 
a, S). Conditions 1—4 of Theorem 4.2.2 apply to p(0 |a, S) to the extent that 
they apply to p(@ | S). This strategy can be useful if generating the vector œ is 
expensive relative to drawing 0 from p(@ | S), and deciding to accept, reject, or 
weight the draws. We reject many draws 0 (without drawing œ), which would have 
had small weights with the importance sampling density p(@ | S) (but nonetheless 
would have required drawing @). 


The ratio o7/t” of the variance of a moment estimate based on hypothetical 
i.i.d. draws to the limiting variance of the estimate based on the importance sample 
is known as the relative numerical efficiency (RNE) of the simulator. If we can 
sample directly from the density p(@ | /)—equivalently, if k(@ | S) « k(@ | I), or 
the weight function w(@) is constant—then the RNE is 1.0. It is generally lower 
for importance sampling. The RNE is inversely proportional to the number of 
iterations of the posterior simulator required to achieve a given NSE. The NSE of 
the approximation is roughly proportional to ø - (M - RNE)~!/?. 


Theorem 4.2.3 Approximation of Bayes Actions by Importance Sampling 
Suppose that the sequence (9 @™} is independent and identically distributed, 
with 0) ~ p(0 | S) and wo” | 0 ~ p(w | 6, I). Suppose that the support 
of p(@ | S) includes ©, and the weighting function w(@) = k(@ | T)/k(0 | S) is 
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bounded above. Let L(a, œ) > 0 be a loss function defined on A x Q, where A is 
an open subset of R”. Suppose that the risk function 


ra) =f | La ope | New| 6, 1)d0 do 
QL7O 


has a strict global minimum ata € A C R”. Consider several additional conditions, 
for a suitably defined open neighborhood of a, N (a): 


1. M—'! Y` ¥_, L(a, @™) converges uniformly to R(a) for all a € N(a), almost 
surely. 

2. dL (a, w)/da exists and is a continuous function of a, for all œ € Q and all 
ac N(a). 

3. 3 L(a, w) /da a’ exists and is a continuous function of a, for all œ € Q and 
alla € N(a). 

4. B = var[dL (a, w) /ða|ı çw (0)! | S] exists and is finite. 

5. H = E[d?L (a, w)/ða da’ |,—4 | S] exists and is finite and nonsingular. 

6. For any ¢ > 0, there exists M, such that 


P[ sup |d°L(a, @)/da; da; ðak| < Ms | S]}>1—e 
) 


acN(@ 
for alli, j,k =1,...,m. 
Let A, be the set of all roots of M~! Y ¥_ [aL (a, @”)/dalw(o) = 0. Then 
(a) Given conditions 1 and 2, for any ¢ > 0, we obtain 


lim P[ inf (a—@)'(a—@) > ¢| S]=0. 
M->co acAy 


(b) Given conditions 1-6, if @y is any element of Ay such that îy > A, then 
@) M'2@y —a) S NO, HBH’). 
Gi) M! Ep- WO™ PIL (a, oO”) /dal,q,, < AL (a, @™)/Ia' laa, > B. 
Gii) M! EX, wO) 02 L(a, @™)/ða da laa, AH. 


Proof: Consider the auxiliary problem in which the pdf of 0 and @ is p(0 | 
S)p(@ | 6, I) and the loss function is L(a, @)w(@). Because the support of p(@ | S) 
includes ©, the unique Bayes action in the auxiliary problem is also a. Then apply 
Theorem 4.1.2 directly to the auxiliary problem. | 


Note that if w(@) is bounded above on ©, which is condition 4 of 
Theorem 4.2.2, then var[dL(a, @) /dal,—qw (0)! | S] will exist and be finite 
if the same is true of var[dL(a, @)/dal,_-q | J]. The same optimization algo- 
rithms may be applied to compute @y here as in the case of direct sampling, 
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except that they are used with R*,(a) = M7! -™“_, L(a, @®™)w(0™) rather than 
with M~! >™“_ L(a, @™). Note, however, that R%,(av)/M~! X, wa”) 5 
R(a). The results of Theorem 4.2.3 may be found in Shao (1989). Importance 
sampling can be especially attractive when the distribution of @ depends on the 
action a, a case we do not consider here; see Geyer (1996). 


Exercise 4.2.1 Sampling from the Tail of the Normal Distribution Suppose 
that the density of interest is the univariate standard normal, truncated to (a, 00), 
where a > 0. The source density is a translated exponential density with pdf p(x | 
S) = 67! exp[—O(x — a) |Ia,oc) (x). Show that the optimal choice of 6 is 0 = [a + 
(4+ a°)!/?]/2. Note that as a > œœ, 0 /a — 1. [Geweke (1991) reports that in the 
context of the algorithm described in Example 4.2.2 the gain in efficiency from 
using the optimal value of 0, relative to the simpler choice 0 = a, is not worth the 
time to compute the optimal value.] 


Exercise 4.2.2 Tuning an Acceptance Algorithm The values of the constants tj 
in Example 4.2.2 are good but not necessarily optimal. The best values are affected 
by the software and hardware used to implement the algorithm. Using software and 
hardware available to you, code the algorithm and see if detectable improvements 
are possible. 


Exercise 4.2.3 Efficiency of Importance Sampling Suppose that x ~ N(0, 1) 
and consider the two, alternative, source distributions for importance sampling: 


x ~ N (0, 4) (4.11) 
x ~ N(0, 2). (4.12) 


Suppose that you were to use importance sampling to approximate E(x) for the 
distribution of interest, x ~ N (0, 1). 


(a) For one of the importance sampling distributions, (4.11) or (4.12), a cen- 
tral limit theorem can be used to assess the accuracy of the numerical 
approximation. Indicate which one, and calculate the relevant variance of 
the approximation. 

(b) For the importance sampling distribution you identified in (a), find the rel- 
ative numerical efficiency of the approximation. 


4.3 MARKOV CHAIN MONTE CARLO 


This section discusses a generalization of direct sampling known as Markov chain 
Monte Carlo (MCMC). The idea is to construct a Markov chain {0} with state 
space © 2 © and unique invariant probability density p(@ | Z). Following an initial 
transient or burn-in phase, the distribution of 0°” is approximately that of the den- 
sity p(@ | I). The exact sense in which this approximation holds is important, and 
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is taken up in Section 4.5. We continue to assume that œ can be simulated directly 
from p(@ | 0, I), so that given {9%} the corresponding w ~ p(w | 6, T) can 
be drawn. We return to the use of {@”} to approximate E[h() | I] in Section 4.5 
as well. This section provides an introduction and heuristic motivation of MCMC. 

Markov chain methods have a history in mathematical physics dating back to 
the algorithm of Metropolis et al. (1953). This method, which is described in Ham- 
mersly and Handscomb (1964), Section 9.3, and Ripley (1987), Section 4.7, was 
generalized by Hastings (1970), who focused on statistical problems, and was 
further explored by Peskun (1973). A version particularly suited to image recon- 
struction and problems in spatial statistics was introduced by Geman and Geman 
(1984). This was subsequently shown to have great potential for Bayesian compu- 
tation by Gelfand and Smith (1990). Their work, combined with data augmentation 
methods (Tanner and Wong 1987), has proved very successful in the treatment of 
latent variables in econometrics. Since 1990 application of Markov chain Monte 
Carlo methods has grown rapidly; new refinements, extensions, and applications 
appear almost continuously. 

This section concentrates on a heuristic development of two widely used MCMC 
algorithms: the Gibbs sampler and the Metropolis—Hastings algorithm. The general 
theory of convergence is discussed in Section 4.5. Section 4.6 details some specific 
variants and combinations of these methods used extensively in the balance of this 
volume. Section 4.7 turns to the assessment of numerical accuracy. While our main 
interest is in applying these methods to the posterior density p(04 | y°, A) they can 
in principle be used with any density, and so we continue with the generic case of 
p@| 1). 


4.3.1 The Gibbs Sampler 


The Gibbs sampler begins with a partition, or blocking, of 0, 0’ = (0,1), ...,0(p))- 
Corresponding to any subvector 6,;), let Oco = = Gry atts Ou p b= ra ..., B) 
and <a) = {Ø}. Similarly #0) = = (O+ JA 0p) "a = 1; —1) and 
O.(B) = {2}. Let 0 o = OL), OL a). In application: we generally fe to choose 
the blocking so that it is possible to draw directly from each of the conditional 
densities p(0 œ) | 0-œ), 7). In this section we shall assume that it is possible. This 
assumption will be weakened subsequently in Section 4.6. 

Suppose that there existed a single drawing 0©, 9” = OD... fs sO): from 
the distribution with pdf p(@ | I). Successively make the drawings 


(1) (69) (0) 
O ~ plo 1026): O) @=1,...,B). (4.13) 
This defines a transition process from 0© to 0®, 9°” = (De eats st 0R) The 
Gibbs sampler is defined by the choice of blocking, and by the fo: of the 
conditional densities induced by p(0 | T) and the blocking. Since 0® ~ p(0 | I), 
(02, 09), 00) ~ p@ | T) at the bth step in (4.13). In particular, 0 ~ p(0 | 1). 
In general, block b of iterate m of the Gibbs sampler is drawn as oy ~ p(o (b) | 


Cee Lae 1) for b=1,...,B and m=1,2,.... This produces a sequence 
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(9), which is a realization of a Markov chain. The transition density for this 
chain is 


B 
m m— (m) (m) (m—1) 
poe G) =[ [pet 10t 00 I) (4.14) 
b=1 


Any single iterate 0°”) retains the property that it is drawn from the density 
p@| I). 

In practice, the Gibbs sampler should use as many blocks as required in order 
to make the drawings in (4.13) efficiently. On the other hand, it should use no 
more blocks than necessary because additional blocks usually reduce efficiency; 
see Exercise 4.5.1 for a simple motivating example. For many problems in econo- 
metrics and statistics the blocking is natural and the conditional distributions are 
familiar. 


Example 4.3.1 A Gibbs Sampler in the Normal Linear Regression Model In 
Example 2.1.2 the independent prior distributions B | A ~ N(B, H`!) and s2h ~ 
X?) led to the conditional posterior distributions 


B | (h.y’,X, A) ~ NB, H) and 3°h| (B,y’, X, A) ~ x20), 


with H 


=H+hX’X, B=H (HB +hX'y’), 3 = s? + (y? — XB)'(y? — XB), 
andv=v+T 


. Hence the blocking 0a) = $, 0o) = h is natural and convenient. 


Of course, if it were possible to make an initial draw from the density p(@ | I), 
then independent draws directly from p(@ | J) would also be possible. The purpose 
of that assumption here is to marshal an informal argument that the density p(@ | I) 
is an invariant density of the Markov chain p(0%™ | 6"—, G): that is, if 0°” ~ 
p(@ | I), then 6°"*®) ~ p(@ | I) for all s > 0. An important remaining task is to 
elucidate conditions for the distribution of 6“ to converge in distribution to that 
of the pdf p(@ | I) given any 0 € ©. 

A more subtle complication is that even if 6 were drawn from p(@| 1), 
the argument just given establishes only that any single 0” is also drawn from 
p(0 | 1). It does not establish that a single sequence {9%} is representative of 
p(@ | I). Consider the example shown in Figure 4.2a, in which © = ©, |J ©», and 
the Gibbs sampling algorithm has blocks ĝa) = 6; and 0) = 02. If 6 € ©), then 
6” € ©, form =1,2,.... Any single 6” is just as representative of p(@ | I) as 
is the single drawing 0, but {0} would not be representative of the distribution 
of interest. Indeed, it would be misleading. In the example shown in Figure 4.2b, 
if 9© is the indicated point at the lower left vertex of the triangle closed support 
of p(0 | I), then 0 = 6© for m=1,2,.... Clearly neither situation arises in 
Example 4.3.1, but evidently a careful development of conditions under which 
(e™} converges in distribution to p(@ | 1) is needed. We return to that development 
in Section 4.5. 
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6 1 | 1 | | 3 1 | 1 fi fi fi 
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04 0, 
(a) (b) 


Figure 4.2. Two examples in which a Gibbs sampling Markov chain will be reducible: (a) disjoint 
support; (b) vertex of closed set support. 


4.3.2 The Metropolis—Hastings Algorithm 


The Metropolis—Hastings algorithm is defined by an arbitrary transition probability 
density function g(6* | 0, H) indexed by 0 € © and with density argument 6*, and 
by an arbitrary starting value 0° € ©. The random vector 6* generated from 
q(0* | 0”? , H) is a candidate value for 0”. The algorithm sets 0°” = 6* with 
probability 


a(6* | 0"—), H) = min 


x * (m—1) 
PO | D/q@ |e", H) if (4.15) 


pee) | 1D /qor-? | 0%, H)’ 
Otherwise, the algorithm sets 0 = Ø", It is common to say that the candidate 
0* is accepted in the first instance and rejected in the second. 


Let 


u(0* | 0, H) = q(0* | 0, Hya (0* | 0, H), (4.16) 
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a density kernel for all accepted candidates. Denote the unconditional probability of 
rejection of the candidate drawn when the current state of the Metropolis—Hastings 
chain is 0 by 


r(0 | H)=1 -f u(0* | 0, H) dv(0*). (4.17) 
o 


This is the probability of rejecting 0*, given 0, but before 6* is actually drawn. 
The Metropolis—Hastings Markov chain is thus driven by an indexed transition 
probability measure defined on v-measurable sets A C ©: 


P(A|0,H)= f u(0* | 0, H)dv(0*)+r(0 | H)I4 (0). 
A 


To write the corresponding transition probability density, let ô (0*) denote the 
Dirac delta function, a linear operator with the property 


[ Osowo = f (0)I4 (0). (4.18) 
A 
Then 

PO” |e"), H) =u (0 | 0”), H) +r Oe" | H)ôgo- (0). (4.19) 


This defines a Markov chain indexed by 0%" that places probability on ©. The 
intuition behind this procedure is evident on the right side of (4.15), and is in many 
respects similar to that in acceptance and importance sampling. If the transition 
density makes a move from 0%" to 6* quite likely, relative to p(6* | I), and 
a move back from 6* to 0"~! quite unlikely, relative to p(0™7P | I), then the 
algorithm will place a low probability on actually making the transition and a high 
probability on staying at 6’"-)) In the same situation, a prospective move from 
0* to 9") will always be made because draws of 0"~!) are made infrequently 
relative to the density of interest p(@ | I). 

This is the most general form of the Metropolis—Hastings algorithm, which is 
due to Hastings (1970). The Metropolis et al. (1953) form takes g(6* | 0, H) = 
q(6 | 0*, H), which leads to the simplification 


a(0* | 0"—), H) = minl p(6* | D/p@0™7? | 1, 1. 


A leading instance of that algorithm is the random-walk Metropolis chain in which 
q(6* | 6, H) =q(6* —@| H), the latter density being symmetric about 0; see 
Example 4.3.2. 

Another special case is the Metropolis independence chain (Tierney 1994), in 
which q(0* | 0, H) = q(0* | H). This leads to 


a(0* | 0"), H) = min[w(6*)/w"), 1], 
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where w(0) = p(@ | [)/q(@ | H). The independence chain is closely related to 
acceptance sampling and importance sampling. But rather than place a low prob- 
ability of acceptance or a low weight on a draw that is unlikely relative to the 
distribution of interest, the independence chain assigns a low probability of accept- 
ing that candidate 0* as the next draw 0%”. 

There is a simple two-step argument that motivates the convergence of the 
sequence {0°”}, generated by the Metropolis—Hastings algorithm, to the distri- 
bution of interest. [This approach is due to Chib and Greenberg (1995).] First, 
note that if a transition probability density function p(@“” | 6“"~), T) satisfies the 
reversibility condition 


DO | Dp0™ OS pe | Dp” | 0™, T) 


with respect to p(@ | I), then 
[ Pe"? | Dp@™ 00, Tdv”) 
= [ pam | por? |e, T)dv(0™?) (4.20) 
= po” | I) por | a, T)dv(0™7?) = po” | D). 


Expression (4.20) indicates that if 0@™7® ~ p(@ | I), then the same is true of 0°”. 
The density p(@ | I) is an invariant density of the Markov chain with transition 
density p(0™ | 0”? , T). This concept is developed more formally in Section 4.5. 

The second step in this argument is to consider the implications of the 
requirement that the Metropolis—Hastings transition density p(0™ | 0°~), H) 
be reversible with respect to p(@ | I): 


pon) | Tp” jaa), H) = po” | Tp") 0™, H). 
For 0”) = 9 the requirement holds trivially. For 0”) 4 6” it implies that 


pom) | Iq (0* | gr) H)a(6* jae), H) 
= p(o* | Dq (0™7P | 0*, Hja (0™7P | 0*, H). (4.21) 


Suppose without loss of generality that 
pO”? | Dq(0* | 0”, H) > pO | Ngo"? | 0*, H). 
If a(0™7" | 6*, H) = 1 and 


p(o* | Da (07? | 0*, H) 


0* o™-d H Z A 
sra = Fe" DaO 10", H) 


then (4.21) is satisfied. 
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Example 4.3.2 Restricted Normal Linear Regression Model—Another App- 
roach Consider once again the situation of Example 4.2.1, a normal linear regres- 
sion model with fixed precision h, and the prior B ~ N(B, H~!) truncated to a set 
S CR*. In the acceptance sampling algorithm developed in that example, 6 is 


drawn from the source distribution N (B, H’), accepted if B € S, and rejected if 
B ¢ S. The probability of acceptance is 


Qm) E" [ exp[—(B — PYT — P)/21 4B, 


and the algorithm will be impractical if this is quite small. 
As an alternative, consider a Metropolis—Hastings algorithm in which the tran- 
sition density g(B* | B™ , H) corresponds to the distribution N (8%, V): 


q(B* | B™, H) = (22) *” V7"? exp[—(B* — B®)’ V-1(B* — B™)/2. 


This is an example of a random-walk Metropolis chain. It shares the property 
q(B* | B™, H) = q(B™ | B*, H) of the algorithm developed in Metropolis et al. 
(1953). The acceptance probability is 


exp[—(B* — B)'H(B* — B)/2] 


( * (rn) H) = ———— 
a(B* IB expl- 6™ — BY AB — B)/21 


if B* € S and 0 if B* ¢ S. This algorithm can succeed where the acceptance algo- 
rithm is impractical if V is chosen carefully. If V is too large, most draws will 
not be in S and will therefore be rejected. As a consequence, in order to generate 
M distinct draws, many times more candidates will need to be drawn. If V is too 
small, most draws will be accepted but the distance moved will be quite small and 
a very large number of iterations will be required to cover § adequately. For the 
algorithm to succeed, V must also be scaled appropriately in all dimensions. 


Exercise 4.3.1 The Behrens—Fisher Problem Here are several Bayesian vari- 
ants of this problem. In each variant 


Yi =O +25 Ym D Yo = O12 +--+ Yn,2) 
a ~N Ea ofr, 0 
y2 tna)’ 0 oz, 
where t, denotes an n x 1 vector (1,..., 1)’. 


(a) Suppose that the prior distribution is 


i/o, ~ x w) 83/03 ~ x w), 


p =m =u ~ Nu, h’). 
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(The random variables on o2, and u are mutually independent.) Construct 
a Gibbs sampling algorithm whose invariant distribution is the posterior 
distribution of (u, g o2). Be completely explicit about all conditional dis- 
tributions. 


(b 


<~ 


Suppose that the prior distribution is 


s/o? ~ xw), 83/03 ~ x w), 
M = (Hy, Ha)’ ~ N (u, H`’). 


(The random variables o7, o2, and u are mutually independent.) Construct 
a Gibbs sampling algorithm whose invariant distribution is the posterior 
distribution of (u, o7, o2). Be completely explicit about all conditional dis- 
tributions. 


(c 


wa 


Suppose that the prior distribution is 
silat ~ X71), 83/03 ~ x W), 
~ ~ pal 
h = (mı +h)/2~NM,h ), 
ui — ua ~ N (0, h`!) truncated to u; — m, > 0. 
(The random variables ot, o3, ft, and u; — m, are mutually independent.) 
Construct a Gibbs sampling algorithm whose invariant distribution is the 


posterior distribution of (u4, M2, ot, o2). Be completely explicit about all 
conditional distributions. 


Exercise 4.3.2 Gibbs Sampling in a Nonlinear Regression Model Consider the 
second-order autoregressive model 


iid. = 
Ye — U = P Or-1 — U) +hoOr2-MW +e, & ~ N(0,h 1) @=1,...,T). 


The values yọ and y_, are fixed. (Equivalently, they are ancillary statistics for m, 
B = (61, 2V, and h.) The prior distribution is 


u~ Nu, h, B~ N(B, 83°), sh ~ x’ 0). 


(In the prior distribution, u, B, and h are mutually independent.) 


(a) Write the likelihood function for u, 8, and h and the respective prior density 
kernels for u, B, and h. 


(b) Design a Gibbs sampling algorithm to draw from the posterior distribution 
of u, B, and A. 


(c) Now suppose that, in addition, we impose the constraint that the roots of 
the polynomial 


1— B,z— Boz? =0 
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satisfy |z| > 1. (This is a technical condition guaranteeing that {y,} does not 
“blow up” as t — oo; more precisely, it ensures that {y,} is asymptotically 
stationary.) Thus B € S C R?, and the prior density kernel for 6 found in 
(a) is now multiplied by /;(f8). Modify the algorithm that you designed in 
(b) to simulate from the posterior distribution in this model. 


(For continuation, see Exercise 4.4.2.) 


Exercise 4.3.3 Shape Constraints in Regression Hildreth (1954) had observ- 
ables as follows: 


Fertilizer Number of Output 
per Acre Observations per Acre 
20 Tı 11, -+> YTI) 
40 Tə O12 - -> YT2) 
60 Ts (Vide -< -> YT3) 
80 Ty (Ya, <--> Yra) 
100 Ts (15, -- -> Yrs) 
120 T6 (yi6; -- -> Yro) 


His model is y; ~ N (uj, o?) with all y;j mutually independent. 
(a) Given the prior distribution 
= (Hy. Ho) ~ N(u, H), 5°/07 ~ x7), 
where w and o? are independent, how would you construct a Gibbs sam- 
pling algorithm whose invariant distribution is the posterior distribution of 


u and o°? 
(b 


<~ 


Suppose that the prior distribution is the same as in (a), except that in 
addition you believe 4; < +--+ < ue. The restrictions could be handled by 
appending an acceptance step to the algorithm in (a), but this could be 
quite inefficient. Show how to construct a Gibbs sampler with B = 7 blocks 
whose invariant distribution is the posterior distribution of u and o?. 


= 
ie} 
wa 


Suppose that the prior distribution is the same as in (b), except that in 
addition you believe that the expected output per acre (given fertilizer per 
acre) is a strictly concave function of fertilizer per acre. How would you 
construct a 7-block Gibbs sampling algorithm whose invariant distribution 
is the posterior distribution of u and o°? 


4.4 VARIANCE REDUCTION 


All the Monte Carlo methods for evaluating E[h(@) | I] considered to this point 
generate an artificial sample ow” ,@™) (m = 1,2,...) and a sequence of weights 
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w(0%”’) with the property that 


M M 
n” =Y who) Y wO) ŽS Eth) | 1. 


m=1 m=1 


Often it is possible to find a function A*(0 4, œ) with the properties 


E[h*(04,@) | 1] = E[h@) | 1], (4.22) 
var[h* (04, @) | I] < var[h(@) | I], (4.23) 


M 
(m) p *ro(m) (m) 


M 
De i OW) 


Typically (4.24) can be derived from (4.22). Then (4.23) suggests that it is likely 
that the numerical standard error associated with h” will be less than that associ- 
ated with h `. The extent of the reduction in numerical standard error, or whether 
any reduction at all will necessarily occur, can be difficult to establish as an analyt- 
ical proposition in typical applications. What is more important is that the relative 
accuracies of the two approximations can be evaluated as a practical matter using 
the central limit theorem appropriate to the method: Theorem 4.1.1 for direct sam- 
pling, Theorem 4.2.2 for importance sampling, or Theorems 4.7.1 and 4.7.3 for 
MCMC sampling. Even more important is the fact that there are systematic meth- 
ods of constructing h* that will supply the potentially superior alternative in the 
first place. This section describes two such methods. 


h *S Efh() | I]. (4.24) 


4.4.1 Concentrated Expectations 


The essentials of the principle of concentrated expectations can be appreciated in 
the generic problem of evaluating the integral [f f(x, y)p(x, y) dx dy, in which 
p(x, y) is a density function. Direct sampling would entail (x, y™) "A p(x, y) 
and the approximation M~! Y ¥_; f(x, y™). Suppose that we can evaluate 
Ef, y) |x] =f f(x, y)p(x, y)dy/ f p(x, y)dy analytically. By the law of 
iterated expectations [see Casella and Berger (2002), Theorem 4.4.3] 


E{E[ f(x, y) |x} = ELf@, y)], 
and by the Rao—Blackwell theorem (Casella and Berger 2002, Theorem 7.3.17) 
var{E[ f(x, y) | xl} < varlf@, y)] 


as long as the latter variance exists and is finite. Hence the numerical standard 
error associated with M~! )°"_, EL f (x, y) | x] is no greater, and is generally 
smaller, than that associated with M~! Y% f (x, y™), The gain comes from 


VARIANCE REDUCTION 129 


performing part of the integration analytically rather than entirely numerically. The 
following result places this idea in context. 


Theorem 4.4.1 Concentrated Expectations in Posterior Simulation Suppose 
that the sequence (0, o } is independently and identically distributed with 
o™ ~ pe” |D and wo” ~ p@l0™, 1). Let 0 = (61,05), of = 
Eo |6™, T), of” = Elo | OY”, 1), © = EW | T), and of? = MON wo” 
(j = 1, 2,3). Then 


M1? @" —B) 4 NO, t?) (i = 1,2,3) (4.25) 
and 


sT. (4.26) 


Proof: Because {0”, w\””} is i.i.d., so are {oo} (j = 1, 2,3). By the law of 
iterated expectations E(w\” | 1) = E(w | 1) (j =1,2,3;m=1,..., M). By the 
Rao-—Blackwell theorem 


var(w$”” | T) < var(w$” | D) < var” | 1). " 


The conditions in Theorem 4.4.1 are those of direct sampling, and hence the 
immediate applicability of this result is rather limited. The treatment of conver- 
gence of MCMC algorithms in Sections 4.5 and 4.7 applies directly to (4.25), and 
consequently the methods there can be used to approximate Ti and thereby evalu- 


ate the accuracy of the alternative approximations ow (j = 1,2, 3). Similarly, the 
method of concentrated expectations can be used in combination with importance 
sampling, and the methods of Section 4.2.2 can be used to assess accuracy. All 
of these methods provide an estimate T and this is generally lower when the 
method of concentrated expectations is applied than when it is not. Whether or 
not (4.26) applies for these algorithms in general is not known. Liu et al. (1994) 
and McKeague and Wefelmeyer (2000) have shown that it does for certain MCMC 


algorithms, given some regularity conditions beyond those discussed in Section 4.5. 


Example 4.4.1 Concentrated Expectations and Prediction in the Normal Lin- 
ear Regression Model In the context of the normal linear model and Gibbs 
sampling algorithm of Example 4.3.1, suppose œ = yr+ı = B’Xr+i1 + €741, where 


the covariate vector xr+; is known. Apply Theorem 4.4.1 using 041 = h to obtain 
—(m)/ 


wo” ~ N(B™ xr341, h™-!), wo” = p™ Xr}; and ws” = B  Xr41, where 


Zim) 


B = (H+ h™X'X) (HB +h X’y’). 
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The numerical standard error of ©@¥ will be much smaller than that of @¥ in 
typical applications; see Exercise 4.4.2 for an example. Moreover, oy requires 
less computing than does either oy or or : 


This example illustrates the fact that virtually every Gibbs sampling algo- 
rithm provides potential applications of the principle of concentrated expecta- 
tions. Since p(@q) | @_@), Z) is the density of a tractable distribution, in general 
E(w | 0-o), I) will be known analytically. Using E (0o) | OF. I) rather than 
Oy to approximate E(6(,) | 1) will typically pay a dividend in the form of reduced 
numerical standard error. As Example 4.4.1 illustrates, this property of Gibbs sam- 
pling algorithms can be used to improve the simulation approximation of other 
posterior moments, as well. 


4.4.2 Antithetic Sampling 


The principle of antithetic variates in Monte Carlo integration dates at least to Ham- 
mersly and Morton (1956). In the original formulation, an i.i.d. sequence of random 
vectors (wo), @?"")’ is generated from a sampling scheme R. The marginal dis- 
tribution of {w/’"} | R (j = 1, 2) is the same as that of an i.i.d. sequence drawn 
from the distribution Z, but cov(w"”, w®™ | R) < 0. Then 


M M 
var [De +a?) /2M | J < 5 var (> wo" /M | r) ; 


m=1 m=1 


This is the relevant comparison if the computation time for generating {w'”} and 
{@®™} is double that for {”} alone. In the limiting case cov(@t™ , w?” | 
R) = —var(w"'” | I) the approximation becomes exact, but then also E(@ | T) is 
likely to be known. To approximate E[h(w) | I], we use 


M 
So [h(o”) + h(o®)]/2M (4.27) 


m=1 


in place of $% h(w”)/M. It need not be the case that the alternative provides 
a more efficient approximation. Loosely speaking, if h is roughly linear over most 
of the support of œ and cov(@"”,, @@” | R) < 0, there will be an improvement. 
For example, if œ™ ~ N (u, 07), with u and o? known, E[h(w"”) | I] cannot 
be derived analytically, and A is roughly linear, then taking w?™ = u — (wo — 
u) = 2u — ow” may provide a substantially more efficient approximation. The 
idea extends immediately to random vectors w. 

This principle applies in the more sophisticated sampling methods taken up in 
this chapter, as well. The only challenge is in finding a sequence @®™ corre- 
sponding to w!”) that may yield improved approximations. Once this is done, the 
application of the idea amounts to using (4.27) in direct sampling and MCMC algo- 
rithms; some straightforward changes in weighting, developed in Exercise 4.4.3, 
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may be required for importance sampling. In the case of Gibbs sampling algo- 
rithms, the individual blocks often provide the basis for the alternative sequence 
@®™ , as illustrated in the following example. 


Example 4.4.2 Antithetic Gibbs Sampling in a Nonlinear Regression Model 
In the context of the nonlinear regression model and Gibbs sampling algorithm of 
Exercise 4.3.2, parts (a) and (b), suppose that the function of interest is the ampli- 
tude of the smallest root of 1 — 8,z — Bz”; denote this root by œ = f(B), where 
B =(6,, 2). In the posterior simulator of Exercise 4.3.2(b), B°-™ is drawn from 
a normal distribution with mean g”. Let wo = f(Bom), pO™ = B= 
Bo”, and wo?” = f(p°™). In the approximation of E(w | I), use (w” + 
w?™") /2 in place of w'””, The methods developed in Section 4.7 for the evalua- 
tion of numerical accuracy apply directly, so the efficiencies of the two alternative 
approximations of E(w | y°, A) will be apparent. 


As sample size T increases, the posterior distribution of T'/?(0 4 — 6%) typically 
becomes symmetric about zero for some 6%, € R*. For example, this will happen 
if Theorem 3.4.3 applies. If the function of interest is smooth, then in the limit 
the approximation problem becomes that of evaluating the mean of a linear func- 
tion of symmetrically distributed random variables. Therefore gains to antithetic 
sampling should increase with sample size, given suitable regularity conditions, a 
phenomenon known as antithetic acceleration. 

These conditions were developed formally in Geweke (1988) for posterior 
moments of the form g = E[g(@4) | Yr, A]. For each T, by direct sampling 


m) iid. o 
ae" pO4 Y9, Am = 1,2,..., 
and the antithetic sample is 
6°” = 2E[04, | Y2, A]—049;””. 


The direct approximation of E[g(@4) | Yọ, A] is z” = = 5” _, 804. ™) IM and 
the antithetic approximation is 


M 
Br =J [sO ar”) + gO ar”) ]/2M. 
m=1 
The regularity conditions in Geweke (1988) include continuous twice differentia- 


bility of g in a neighborhood of 0%, with a = g’(0%,) and B = (1/2)g”(0%), as 
well as 


lim T var(@4 | Yr, A) = Zz and lim T? var(0 B04 | Yr, A) =ô >00, 
T—œ T—=œ 
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conditions similar to those in Theorem 3.4.3. Then M!/ ace — g) 4 N (0, i); 
M!” — 3) > NCO, r#), and 


Jim Tre /t} = 5/0 Le. 
>00 


There are no parallel results for importance or MCMC sampling, but this is no 
impediment to the use of antithetic sampling. The additional demands are usually 
modest, as suggested by Example 4.4.2, and the accuracy of the approximation can 
be evaluated using the method described in Section 4.7. 


Exercise 4.4.1 Improving the Approximation of a Posterior Probability 
Reconsider the problem of approximating (4.1) in Example 4.1.1. 


(a) Express P(w < a | x*, ß, h, A) in terms of ®(-), the cdf of the standard 
normal distribution. (Most mathematical applications software can evaluate 
® efficiently.) 


(b) Use the result in (a) to find a numerical approximation to (4.1) with variance 
less than that of p“, the approximation suggested in Example 4.1.1. 


Exercise 4.4.2 Variance Reduction Methods and Forecasting Consider the 
nonlinear regression model of Exercise 4.3.2 and the Gibbs sampling algorithm 
developed there. This model can be used to forecast œ = (yrii,..., Yr+r)’. If the 
loss function is quadratic in œ, then the appropriate forecast is y = E(w | Y$, A). 
This could be accomplished by drawing 

oy” ~ N[u™® + Be (ye — u) + BY? O2 — w™), HOY, 

wy” ANMO + BY” SRIF O — we), HO, 

oP ~ NU + BY (a, — u) + By (oy) — W™), HO", 


s =3,..., F. Then the simulation approximation of ¥ is M! X ¥_; @™. This 
exercise utilizes variance reduction methods to improve on this procedure. 


(a) Use the law of iterated expectations to show that conditional on (Y7, B, u, A) 


E(yr+1 — U) = Bi? — H) + Bo O71 — U), 
E(Yr+2 — u) = p E Or+1 — U) + b27 — U), 
E(Qyr+s — U) = 1E (YT+s-1 — U) + Bo Qrts—2 — H) 


for s = 3,..., F. Use this result, the principle of concentrated expecta- 
tions, and the original simulation sample {u, 8B”, h} to improve the 
approximation of F. 
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(b) Apply the principle of concentrated expectations to E[B | Y?, u, h, A] to 
further improve the approximation of Yr+ı. Does this idea extend to yr42? 
(c) Show how antithetic samples of £ | (Y$, u, h, A) could be used to improve 


the approximation of Yr+s (s = 2,..., F). Would you expect the improve- 
ment to be greater for V4. or for Prr? 


Exercise 4.4.3 Antithetic Importance Sampling In many importance sampling 
algorithms the source density p(0 | S) is symmetric about u = E(0 | S), and con- 
sequently the construction of the antithetic sequence 0°”) = 2u — 6” is trivial. 
Corresponding to 0), wo”) ~ p(w | 6, T), and corresponding to 0°"), @2”) ~ 
plo | 62", 1) (m = 1,2,...). Define h, w(), h°”, and t? as in Theorem 4.2.2. 


(a) Show that 


M 
Stu) a e [w0 hlo) +4 w0 ™ hlo] be 


h S Fy 
Yow) + w0C™)] 


> 


(b) State conditions under which M? m” 


(c) Show that t*?/t? < 5 if and only if 


-T5 S NO, 1%). 


cov{[h(@™) —h]wo”), [h(w?”™) —hlwo?”) | S} <0. 


Under what conditions is this inequality likely to hold? 


4.5 SOME CONTINUOUS STATE SPACE MARKOV CHAIN THEORY 


The informal treatment of the Gibbs sampler and the Hastings—Metropolis algo- 
rithm in Section 4.3 leaves unresolved important questions about the conditions 
under which the simulation sample {0, ...,0°”} will become representative of 
p(@ | I) as M —> œ. This section turns to that question. Much of the treatment 
here draws heavily on the work of Tierney (1991, 1994), who first used the the- 
ory of continuous state space Markov chains to demonstrate convergence, and 
Roberts and Smith (1994), who elucidated sufficient conditions for convergence 
that turn out to be applicable in a wide variety of problems in econometrics and 
Statistics. 

Let C denote a generic Markov chain {9%} defined on © x © by a transition 
kernel u(0* | 0, C) with the property that 


r@|cy=1 -Í u(0* | 0,C)dv(6*) € [0,1) Y0 € ©. 
o 


134 POSTERIOR SIMULATION 


For any v-measurable set A C ©, 


PO™ e A |0”, C) = Í u(0 | 0”, C) dv(0) 
A 
+r" | CLO"). 


In the case of the Gibbs sampler discussed in the previous section the transition den- 
sity function p(6* | 0, G) is defined in (4.14), and r(@ | C) =0 Y 0 € ©. In the 
case of the Hastings—Metropolis algorithm, u(6* | 0, C) = u(0* | 0, H), defined 
in (4.16), and r(0 | C) =r(@ | H), defined in (4.17). The transition kernel u is 
substochastic; it is proportional to the probability density of the accepted candi- 
dates only. The corresponding substochastic kernel over m steps is then defined 
iteratively: 


HO O™ | 0®, C) = f u= (0 | 0®, C)ju(0™ |0, C)dv(0) 
© 
+u D 0 |G, Cro” |C) 
+ [r@® | OPT ™ | 0, C0). 
This describes all m-step transitions that involve at least one accepted move. 


Definition 4.5.1 An invariant kernel for a Markov chain with transition kernel 
u(0* | 0, C) is a nonnegative function k(@ | C) with support © that satisfies 


i kO | C) p u(0* |0, C)dv(6*) + r(@ | on] dv(0) 
($) A 
= fre | C)dv(0) = K(A | C) 
A 


for all v-measurable A. 


Definition 4.5.2 The transition kernel u(0* | 0, C) is p-irreducible if for all 
0® € ©, K(A | C) > 0 implies that PO € A | 0®, C) > 0 for some m > 1. 


Situations like the one shown in Figure 4.2a, where the support is disconnected 
and the Markov chain is the Gibbs sampler, cannot arise if u is p-irreducible. 
Referring to Figure 4.2a, note that if 0% e ©}, it is impossible that 60” € ©» for 
any m > 0. At best there are two invariant distributions, one for ©, reached if 
OMe ©, and one for ©» reached if 0% e Q>. 


Definition 4.5.3 The transition kernel u(0* | 0, C) is aperiodic if there exists 
no v-measurable partition © = (ge ©, (r > 2) such that for some 0 € © 


P(O™ € Onmoar) | 0, C) = 1 Vm. 
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Definition 4.5.4 The transition kernel u(0* | 0, C) is Harris recurrent if for 
all v-measurable A with f} k(@ | C) dv(@) > 0, all 0 € ©, and all L < co 


M 
iene 2 140) <L| c| =0. (4.28) 


m=1 


It follows at once that if a transition kernel is Harris recurrent, then it is 
p-irreducible. 

An invariant kernel k(@ | C) is defined only up to an arbitrary scaling constant. 
Recall from Section 3.2 that if a kernel is finitely integrable, then it is proper. In 
this case there is a unique probability density p(@ | C) corresponding to k(@ | C). 


Definition 4.5.5 If an aperiodic and Harris recurrent transition kernel u(6* | 
0, C) has a proper invariant kernel, then u is ergodic. 


Theorem 4.5.1 Convergence of Continuous State Markov Chains Suppose 
that k(@ | C) is an invariant kernel of the transition kernel u(6* | 6, C). 


(a) If u is p-irreducible, then the invariant kernel is unique up to a scaling 
factor. 

(b) If u is p-irreducible and aperiodic and the invariant kernel is proper, then 
there exists a set O C © with J3 p(0 | C)dv(0) = 1 such that if 6 €6, 
then 


lim [ lu (0 | 0, C) — p@| C)|dv@) = 0. (4.29) 
m> © 


If u is ergodic (i.e., if it is also Harris recurrent), then O=. 


(c) If u is ergodic, then for all 6 € © and all functions g(0) such that Jo le(0)| 
p(0 | C)dv(0) < œ, we obtain 


M 
M'S g0™ 5 | g)p@| C)dv@). (4.30) 
© 


Proof: Conclusions (a) and (b) follow immediately from Theorem 1 and con- 
clusion (c) from Theorem 3, both in Tierney (1994). E 


Observe that if the transition kernel u is ergodic, then the invariant distribution 
is unique and (4.29) and (4.30) both obtain. 

In using simulation methods for Bayesian inference we are concerned with 
vectors of interest œ, and functions h(@), that are not deterministic functions of 0. 
Theorem 4.5.1 can be extended immediately to include these cases. 


Theorem 4.5.2 Convergence of a Vector of Interest with Continuous State 
Markov Chains Suppose that {0%} is ergodic with invariant density p(0 | I) 
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and œ™ ~ p(w | 6, I). Then {0, w”} is ergodic with invariant density p(6 | 
I) p@ | 9,1). 


Proof: Since {0} is aperiodic, (0, @™} must be also. Let G be any sub- 
set of © x Q such that Je pO | 1)p(@ |0, I)dv(@)dv(0) > 0, and let N be any 
positive integer. Let D = A x B CG, such that pl@ € B|0,1)>e>O0VOEA. 
Then 


M M 
Di Ig(0™ o”) > > Ip” wo”) 


m=1 m=1 


but 
M 
en 1G”, 0”) 
lim P | S > 5 [etait 
> Ao”) 
Since (4.28) is true for L = [2N/e] + 1, (0°, @®™} is Harris recurrent. = 


An ergodic Markov chain can be used to compute simulation-consistent approx- 
imations of Bayes actions, in exactly the same way as was the case for direct 
sampling (Theorem 4.1.2). 


Theorem 4.5.3 Approximation of Bayes Actions by MCMC Sampling Sup- 
pose that in the Markov chain C the sequence {0% , @™ } is ergodic with invariant 
density p(@ | [)p(@ |0, I). Let L(a, œw) > 0 be a loss function defined on A x Q 
and suppose that the risk function 


ra) =f | La opO po | 9) d0 do 
QL7O 


has a strict global minimum at 3 € A C R”. Suppose further that for a suitably 
defined open neighborhood of a, N (a): 


1. M'Y Lao”) Æ R(a) uniformly on N (Â). 
2. dL(a, w)/da exists and is a continuous function of a, for all œ € Q and all 
ac N(a). 


Let Ay be the set of all roots of M~! yoM dL(a, @™)/ða = 0. Then for any 


m=1 


e > 0, limmo Plinfaca, (a — 3V (a — 3) > e | C] =0. 
Proof: The result follows from Amemiya (1985), Theorem 4.1.2. E 


Condition 1 can make this result somewhat more awkward to apply than 
Theorems 4.1.2 or 4.2.3. Although it must be verified for the problem at hand, 
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it can be expected to apply widely. For example, if in the invariant density 
p(@ | 1)p(@| 0,7) has moments at least of order n and L(a, œ) is a polynomial 
of order n in æ, then condition 1 is satisfied. 

Section 4.3 demonstrated how to construct an MCMC algorithm for which spec- 
ified p(@ | I) is an invariant density. In order to use the realization from such an 
algorithm to provide a simulation-consistent approximation of a moment under 
p(@ | I), it is necessary to show that the transition density of the algorithm is 
ergodic. Direct application of Theorem 4.5.1 can be somewhat tedious. For the 
Gibbs sampler and the Hastings—Metropolis algorithm there are respective suffi- 
cient conditions for ergodicity that are easier to verify. These conditions are stronger 
than those in Theorem 4.5.1, but they are often satisfied in practice. 


4.5.1 Convergence of the Gibbs Sampler 


Suppose that a Gibbs sampler is constructed from a specified probability density 
p(@ | I) as described in Section 4.3.1, producing the transition density p(0%™ | 
6°"-) G) defined in (4.14). If 8 € ©, then p(O | J) is an invariant density of 
p(0™ | @°"—) G) as shown in Section 4.3.1. It remains to establish that the den- 
sity of interest p(@ | I) is the unique invariant density of the Markov chain, and 
the sense in which the chain converges. The following result is immediate and is 
often easy to apply. 


Corollary 4.5.1 A First Sufficient Condition for Convergence of the Gibbs 
Sampler Suppose that for every point 0 € © and every v-measurable A C © 


f p0 | IDDdv(0)>0 > f p(0* | 0, G) dv(6*) > 0. 
A A 
Then the transition kernel of the Gibbs sampler G is ergodic. 


Example 4.5.1 Convergence of the Gibbs Sampler in the Normal Linear 
Regression Model Corollary 4.5.1 establishes the ergodicity of the Gibbs sam- 
pler in Example 4.3.1. To turn to a common but more difficult variant, consider 
the special case of the restricted normal linear regression model of Examples 
4.2.1 and 4.3.2, in which S = {$ : a; < b; < wii = 1, ..., k)}, where —œ <a; < 
w; < œ(i = 1,..., k). The Gibbs sampler can be used to draw from the posterior 
distribution, with a full blocking on each 6; as well as A. Corollary 4.5.1 estab- 
lishes convergence (but Theorem 4.5.4, below, does not). Section 5.3 treats this 
model, and some variants on it, in detail. 


An alternative to Corollary 4.5.1 is provided by Roberts and Smith (1994). 
Theorem 4.5.4 A Second Sufficient Condition for Convergence of the Gibbs 


Sampler Suppose that 0 | I is absolutely continuous and the following three con- 
ditions are satisfied: 
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1. The invariant density p(@ | I) is lower semicontinuous; that is, for all 6 with 
p(@ | T) > 0, there exists an open neighborhood Ng of 0 and € > 0 such that 
for all 0* € No, p(O* | D = 0. 

2. For every point 0* € © and each block b of the Gibbs sampler, there exists 
an open neighborhood N(@(_,)) of @(_,) and a bounded function c(0(—»)) 
such that for all 0») € N (0%) 


f POC 90), Oon |D dO) < cO). 
Ow) 


3. © is connected. 
Then the transition kernel of the Gibbs sampler is ergodic. 
Proof: See Theorem 2 of Roberts and Smith (1994). E 


Theorem 4.5.4 rules out situations like the one shown in Figure 4.2b, where the 
support of the posterior density is a closed set. For any point @ on the boundary 
there is no open neighborhood Ng such that for all 6* € Nọ, p(0* | I) is bounded 
away from 0. 


Example 4.5.2 Convergence of the Gibbs Sampler in a Normal Linear Regres- 
sion Model with Weak Inequality Constraints Consider the normal linear re- 
gression model 


yr ~ N(By + Box + B3x3, ho!) @=1,...,7), 
B’ = (Bi; Bo, Bs)’ ~ N(B, H’), s*h ~ x*(v), 


subject to the constraint 0 < 6, + 63 < 1. If a Gibbs sampler with the four blocks 
bi, Bo, B3, h is used, then Theorem 4.5.4 establishes ergodicity but Corollary 4.5.1 
does not. On the other hand, suppose that the model is recast as 


Ye ~ NY + Yxn + 13003 — X12), h'} GH), eg FY; 
Y= Yi Yz V3) ~ N(AB, AHA’), Ph ~ x7), 


subject to the constraint 0 < y, < 1, where 
100 
A=]01 1 
001 
Then a Gibbs sampler with the three blocks (y,, y3), Y2, and h can be used, and 
either Corollary 4.5.1 or Theorem 4.5.4 establishes ergodicity. This is a particular 


case of linear inequality constraints in the normal linear regression model taken up 
in Section 5.3. 
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Tierney (1994) discusses weaker conditions for convergence of the Gibbs sam- 
pler. However, the conditions stated here are satisfied for a wide range of problems 
in econometrics and statistics and are substantially easier to verify. 


4.5.2 Convergence of the Metropolis—Hastings Algorithm 


Tierney (1994) and Roberts and Smith (1994) show that the convergence properties 
of the Metropolis—Hastings algorithm are inherited from those of g(6* | 0, H); if q 
is aperiodic and p-irreducible, then so is the Metropolis—Hastings algorithm. This 
feature leads to a sufficient condition for convergence analogous to Corollary 4.5.1. 


Theorem 4.5.5 A First Sufficient Condition for Convergence of the Metro- 
polis—Hastings Algorithm Suppose that for every point 0 € © and every AC © 
with the property f, p(@ | T) dv(@) > 0, it is the case that 


/ q(0* |0, H) dv(6*) > 0. 

A 

Then the transition kernel of the Metropolis—Hastings algorithm is ergodic. 
Proof: See Tierney (1994), Corollary 2. | 


The condition in Theorem 4.5.5 may be restated as requiring that if 0 is in 
the support of p(0 | J), then the support of q(0* | 6, H) includes the support 
of p(@ | I). This condition is satisfied for many Metropolis—Hastings algorithms, 
which are therefore ergodic. 


Example 4.5.3 Some Generically Ergodic Metropolis—Hastings Algorithms 
Example 4.3.2 introduced the specific case of a random-walk Metropolis—Hastings 
chain in which g(6* | 0, H) is the N(@, V) density. Since the support of this den- 
sity is R*, this algorithm satisfies the condition in Theorem 4.5.5. Any Metropolis 
independence chain q(0* | 0, H) = q(0* | H) in which the support of q includes 
the support of p(@| J) is also ergodic. As should be clear from previous dis- 
cussion—Example 4.3.2 in the former case and Section 4.2 in the latter—this 
condition provides no assurance that the algorithm is sufficiently efficient to be 
practical. Further analytical work, trial computations, or both are needed to provide 
a form of the algorithm that is practical in each case. 


A complementary sufficient condition for convergence of Metropolis—Hastings 
chains is provided by the following result, which is analogous to Theorem 4.5.4 
for the Gibbs sampler. 


Theorem 4.5.6 A Second Sufficient Condition for Convergence of the Metro- 
polis—Hastings Algorithm Suppose that for all pairs (0,6*) € © x ©, p(@ | 1) 
and q(6* | 0, H) are positive and continuous. Then the Metropolis—Hastings tran- 
sition kernel is ergodic. 
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Proof: See Mengersen and Tweedie (1996), Lemmas 1.1 and 1.2. E 


Exercise 4.5.1 Simulation from the Bivariate Normal Distribution (This is a 
continuation of Exercise 4.1.2.) Suppose (x, y) T N(u,Ł) with 


u= (a yz EB eal 
W2 on 2 
Let ee j=1,2;m=1,2,...) denote mutually independent, standard normal 


random variables. The function of interest is 
(x? + y?) 1/2 if x? + y? <1 
IQ, y) = F 2 2 
1 if x4 + y^ > la 


(a) Consider the simulation 


x = p + (02/02) G = w) + (O11 — 0/02) 76, 


y™ = m + (n/r) — u) + (on — 07,/o1) 72%, 


with yO ~ N (u2, 022). Show that 


M 
M1) fe™, y™) S ELF(, y). 


m=1 


(b) Derive the correlation coefficient between x” and xt) in (a). [Recall 
that in the algorithm in Exercise 4.1.2, x and xt! are uncorrelated.] 


Exercise 4.5.2 Convergence of MCMC with an Inequality-Constrained 
Support Return to the algorithms you constructed in Exercise 4.3.3. 


(a) Show that the Markov chain is ergodic, where the posterior is the unique 
invariant distribution in each case. 

(b) Suppose that the inequality restrictions in parts (b) and (c) of Exercise 4.3.3 
were weak rather than strong. Modify the Markov chain appropriately, and 
show that it is ergodic. 


Exercise 4.5.3 Identification, Proper Posteriors, and MCMC This problem 
asks you to carry out some exercises with a simple normal model. The main 
lesson to be drawn is in part (e): MCMC algorithms can be constructed in mod- 
els where the posterior distribution does not even exist, if one is careless. Part 
(b) illustrates how a model that is unidentified in the conventional sense that 
p(y | 941, A) = p(y | O42, A) for 04; 4 O42 can still have a posterior distribution 
for all parameters, given a proper prior. [For more on identification in a Bayesian 
context, see Poirier (1998).] If you have had exposure to the ideas of integration 
and cointegration in time series, then the sense in which an invariant distribution 
does not exist in part (e) should be quite clear. 
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(a) Suppose y, | (u, A)" N(w, 1) = 1,...,7) and p| A~ N(0,A7)). 
Derive u | (Y°, A). 
(b) Now consider the model 


— iid. 
u| B~N@O,A'b), y| (eB) N(u; +m, 1). 


Derive p | (y°, B). 


(c) Consider the model y; | (m, C) Hii y (ui + u2, 1) with an improper prior 
distribution p(t | A) oc 1 Y w € R?. Show that the posterior distribution is 
improper. 

(d) Derive the distributions u, | (Y°, M2, C) and u, | (y’, u1, C). Note that each 
is proper. 

(e) Construct a Gibbs sampling algorithm based on your work in (d): 


—1 
wy” ~ plui Ly’, u$’, C), 


us” ~ plu 13°, wy”, C). 


Show that 
@ uE” =p") +¢,, where ¿n E N(O,2T-!). 


(i) WY” +u E NO, T. 


Exercise 4.5.4 Economic Decisionmaking A stochastic production relation is 
modeled as 


ye = By + Box + 3X2 + Baxi F pete + BoxxX + €r, (4.31) 
er X N(0, h’, 


where y; is output and x,; and xn are the two inputs treated as ancillary statistics 
in the model. The prior distribution has two independent components: 


B = (B1, Bo, Bs, Ba Bs, Bo)’ ~ N(B,H') (4.32) 
sh ~ x7(v). 


The distribution in (4.32) is subject to the further restrictions that 
EO |X) = By + Box1 + B3x2 + Baxt + Bsxz + Boxix2 


is a strictly concave function for all x = (x1, x2)’ > 0, and that there exists x* > 0 
such that E(y | x*) > E(y | x) for all x > 0. 


(a) Carefully express the posterior density kernel for this model. 
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(b) We could use any one of several posterior simulators in this model. Describe 
two different posterior simulators. What considerations will be important in 
determining which is more efficient? (A good answer will have nontrivial 
differences in the two simulators, plus a substantial discussion of efficiency.) 


(c 


wa 


Suppose that you have completed (a) and (b) and have posterior simulator 
output {8% , h™} (m = 1,..., M) using all the data for periods 1 through 
T. The manager of the firm using the production relation (4.31) knows 
output price pr+; and input prices rr4+1,1 and rr+1,2 for xr+1,1 and x741,2, 
respectively, in period T + 1. He must choose x741,; and xr+1,2 before 
observing €74,. (Inputs cannot be negative.) The manager’s objective is to 
choose x741,; and x741,2 so as to maximize 


E{U (ar41) | [Or Xt1, X12) (t = 1, sey T), PT+i1; FT+1,1; rT+1,2]} 


where 7741 = PT+1YT+1 — FT+1,1XT+1,1 — FT+1,2XT+1,2 and U(-) is a mono- 
tone increasing, strictly concave function with first and second derivatives 
that are easy to compute. Indicate how you would solve the manager’ s 
problem using the output of the posterior simulator described in (b). 


4.6 HYBRID MARKOV CHAIN MONTE CARLO METHODS 


The utility of Monte Carlo methods in Bayesian inference stems in great part 
from combinations of algorithms. Example 4.2.4 showed that a hybrid importance 
and acceptance sampling algorithm could be more efficient than either algorithm 
alone. The addition of MCMC algorithms widens the scope for combination. These 
hybrid algorithms not only increase efficiency. More importantly, they can provide 
elegant yet practical solutions of difficult problems in the construction of posterior 
simulators. This section examines two such hybrid algorithms. 


4.6.1 Transition Mixtures 


In the context of the Metropolis—Hastings algorithm, suppose that there are J dif- 
ferent transition probability densities q (0* | 0, H;) that might be used. A transition 
mixture chooses randomly between the J densities q(0* | 0, H;) with respective 
choice probabilities 2 ; assigned to the densities. The probabilities x ; are constant 
and do not depend on 8: 


J 
q(0* 10, H) = È x jq(0" | 0, Hj). 


j=1 
Once density j is selected, a candidate 0* is drawn from q(0* |0, H;), and is 
accepted with probability 


(4.33) 


a0 10, Hp = min OO | 
| Hy) = | 


pO | 1)/4(0 | 0*, Hj)’ 
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Note that only the chosen transition density enters (4.33): the other densities and 
the choice probabilities x ; are irrelevant to acceptance or rejection of the candidate 
once the transition density has been selected. 

To see that p(@ | /) is an invariant density of the transition mixture, note that 
the reversibility condition is 


J 
p@|1) >) 144" | 6, H)a(0* | 0, Hj) 
j=l 
J 
= pO" | 1) xjq@ | 0, Hj)a(0 | 6”, Hj). 
j=l 


This condition holds if 


PO | 1)q(O | 0, Hj)a(O" | 0, Hj) 
= p(O* | 1)q(0 | 0*, H)a(0 | 0%, H) (j=1,...,J). (434) 


Condition (4.34) leads to (4.33), just as (4.21) led to (4.15). 

Transition mixtures can be powerful tools in building posterior simulators that 
are ergodic and robust to ill-behaved posterior distributions. To see how ergodicity 
arises, note that if the support of at least one q(0* | 6, H;) includes the support 
of p(@ | I), then the same is true of the transition mixture. Theorem 4.5.5 then 
implies that the transition mixture kernel is ergodic. [Tierney (1994) shows that it 
is sufficient that just one of the transition kernels be ergodic. ] 


4.6.2 Metropolis within Gibbs 


Suppose that in attempting to implement a Gibbs sampling algorithm, a conditional 
density p(0 œ) | 9_(»), I) is intractable. The density is not of any known form, 
and efficient acceptance sampling algorithms are not at hand. This problem can 
be addressed by applying the Metropolis—Hastings algorithm in block b of the 
Gibbs sampler while treating the other blocks in the usual way. Specifically, let 
qi) | 0, Hp) be the density (indexed by 0) from which candidate 8%) is drawn. At 
iteration m, block b, of the Gibbs sampler draw 0%, ~ q (0%) | oy. pA Hp), 


and set o = 0%, with probability 
* (m) (m—1) 
a (O | Ow: 0 o- Hs) 
(m) * (m—1) * (m) (m—1) 
eer P(A 2) Oo 92 | 1)/4 (8) | O26) 920-1)» Ho) 1| 


m) gm- CEO a ae 
P(O Zo 926-1 | 1)/4(86) 1 926) Oo 2% > Ho 


If OS) is not set to 0%), then oy = Caos The procedure for 0) is exactly the 
same as for a standard Metropolis step, except that @(_») also enters the density p 
and transition density q. It is usually called a Metropolis within Gibbs step. 
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To see that p(@ | I) is an invariant density of this Markov chain, consider the 
simple case of two blocks with a Metropolis within Gibbs step in the second 
block. Adapting the notation of (4.19), describe the Metropolis step for the second 
block by 


POH | 8), 9.2), H2) = UG) | Oy, 82), Hr) +72) | 4), Ha Oh) 
where 
U( >) | 81), 9(2), Ha) =a Oh | Oy, 92), Da Oa | Oy, 9), Ad) 


and 
r(O(2) | Aa), M) = 1 -fÍ ullo |0, 0o, M) dvlOh). (4.35) 
@& 
The one-step transition density for the entire chain is 


PO |0, G) = pO% | Oa, 1) PO%) | Oa), 0o, H) 


Then p(@ | J) is an invariant density of p(0* | 0, G) if 
f pO | Dp@* |0, G) dv(0) = p@* | D). (4.36) 
© 
To establish (4.36), begin by expanding the left side: 


T pO | 1)pO* 10, G)dv@) = f J 28.1), 8 | dv.) POR, | Bc). 2) 
© ©: 70; 
-uO lO 9.2), Ha) +r Oo | 0%), Hia Ool) 


= ji PO | LPOG) | Oa Dubh |0 92), He) dvOe) (4.37) 
O2 
+f PO | DPO | Oa | Dre) | 00), H2) Oa) dv(Oa). (4.38) 
In (4.37) and (4.38) we have utilized the fact that 
PO |= Í PO), 9a | 1) dv@q)). 
oF 
Using Bayes rule (4.37) is the same as 


POW, | nf PO (2) | OG), DuC | O01), 8.2), Hz) dv@ o). (4.39) 
om 
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Carrying out the integration in (4.38) yields 
PO | DPO | 8% | Dr Oi) | 0%), Ab). (4.40) 
Recalling the reversibility of the Metropolis step, we obtain 
PO.) | Oi), 1)uOo | ie 92), Ha) = POO) | On» Dulo | Oi), 9%), H2), 


and so (4.39) becomes 
POX | D pOh | 0%» nf u0 o) | 0%), Oo)» HM) dv(0 o). (4.41) 
© 


We can express (4.40) as 
pO. Oo | Dr O | Oi A). (4.42) 


Finally, recalling (4.35), the sum of (4.41) and (4.42) is PO), (2) | Z), thus estab- 
lishing (4.36). 

This demonstration of invariance applies to the Gibbs sampler with b blocks, 
with a Metropolis within Gibbs step for one block, simply through the convention 
that Metropolis within Gibbs is used in the last block of each iteration. Metropo- 
lis within Gibbs steps can be used for several blocks, as well. The argument 
for invariance proceeds by mathematical induction, and the details are the same. 
Ergodicity can generally be established in the same way as for Gibbs samplers 
generally; Corollary 4.5.1 often applies. Section 7.1 provides a specific example of 
the Metropolis within Gibbs algorithm. 


4.7 NUMERICAL ACCURACY AND CONVERGENCE IN MARKOV 
CHAIN MONTE CARLO 


In any practical application we are concerned with the discrepancy between a pos- 
terior moment h and its numerical approximation from a posterior simulator with 
M iterations, he. If the sequence {h(w”)} were i.i.d., this discrepancy could be 
evaluated by means of a conventional central limit theorem and the resulting numer- 
ical standard error (NSE, defined on p. 107) and the efficiency of the algorithm 
could be assessed using the estimated relative numerical efficiency (RNE, defined 
on p. 117). Serial correlation in {h(@”)} is inherent in Markov chain Monte Carlo, 
however, and so the need to evaluate numerical accuracy of the approximation he 
must be evaluated afresh. The serial dependence in MCMC algorithms also raises 
the prospect that if the initial value 9© is remote from the posterior distribution, 
then, although hos h, early values of h(w”) may be atypical and approxima- 
tions would be improved by discarding these early values. The issue here is what 
constitutes “early” and eliminating the possibility that M is still “early” in the 
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sequence. This issue is often referred to as “convergence.” There is a substantial 
literature on the practical aspects of these issues. Key works include Gelman and 
Rubin (1992), Geweke (1992), Geyer (1992), Cowles and Carlin (1996), Gelman 
(1996), Brooks and Gelman (1998), and Brooks and Roberts (1998). 

To illustrate the issues, consider the Gibbs sampling algorithm for a normal 
distribution of the random variables (x, y), one block for x and one for y. (See 
Exercise 4.5.1.) Figure 4.3 illustrates the first 400 iterations from two distributions, 
each with mean zero and unit variances for x and y. In the first distribution [panels 
(a) and (b)] the correlation between x and y is pọ = 0.90 and in the second dis- 
tribution [panels (c) and (d)] it is o = 0.99. Highest-density regions of size 0.2, 
0.4, 0.6, 0.8, and 0.95 are indicated in panels (a) and (c). Panels (a) and (c) show 
(x, y) values from every other iteration, with the first 100 iterations indicated by 
a cross and the last 300 by a solid point. Panels (b) and (d) show the values of 
x in each iteration. In each case the starting value is x = y = 8. Values this far 
from the mean are improbable for either distribution, and this will often be the 
case in research applications. The values chosen here represent that situation, and 
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Figure 4.3. Output of a Gibbs sampler for a bivariate normal (x, y), blocked in x and y: p = 0.9, 
every second of 400 iterations (a) and x values of all iterations (b); o = 0.99, every second of 400 
iterations (c) and x values of all iterations (d). 
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illustrate the implications for convergence. In the first case, o = 0.90, values of 
x and y have become representative of the bivariate distribution well before the 
100th iteration. In the second case, p = 0.99, all the first 100 iterates are in the 
positive orthant, and in the next 300 draws there are more iterates in the positive 
than the negative orthant. This comparison illustrates the point that convergence 
questions arise from the serial dependence in the Markov chain: the greater the 
serial dependence, the longer it will take for the chain to become representative of 
the invariant distribution, other things the same. 

We will proceed by first developing some tools for computing NSE and RNE, 
under the assumption that the sequence {90% , @™} is stationary. Since the initial 
simulation @© is not drawn from the posterior distribution and the sequence is seri- 
ally dependent, this is not literally true. However, the analysis of the stationary case 
leads to some analytical tools that are useful in addressing the convergence ques- 
tion. The foundation for evaluating numerical accuracy is a central limit theorem 
for continuous state space Markov chains. 


Definition 4.7.1 Suppose that a Markov chain C has n-step transition proba- 
bility P”(A | 0, C) = P(0™® € A | 0® =86,C), defined on all v-measurable sets 
A, and unique invariant probability P(A | C). The Markov chain is uniformly 
ergodic if 


sup | sup |P" |0,C)— P(A | ol] < Lr" (4.43) 


0O A 


for some L > 0 and some positive r < 1. 


Tierney (1994, p. 1714) derives two results that are useful in establishing uni- 
form ergodicity. First, an independence Metropolis chain with bounded weight 
function w(0) = p(0 | I)/p(0 | H) is uniformly ergodic. (Recalling the similarity 
between the independence Metropolis kernel and importance sampling, and the 
discussion about bounded weight functions following Theorem 4.2.2, this result 
is not surprising.) Second, if one kernel in a transition mixture (Section 4.6.1) is 
uniformly ergodic, then the mixture kernel itself is uniformly ergodic. Thus for any 
Markov chain {6} we could in principle guarantee (4.43) by mixing the chain 
with an independence Metropolis kernel with a bounded weight function, as long 
as the posterior mean and variance were known to exist. If the likelihood function 
is bounded, then the prior distribution itself will provide such an independence 
transition kernel. The practical difficulty with this approach is that in most prob- 
lems draws from the prior will rarely be accepted, and it is often difficult to find an 
independence kernel that overcomes this difficulty. Thus attention typically focuses 
directly on the unmixed MCMC algorithm. 


Theorem 4.7.1 A Central Limit Theorem for MCMC Approximation of 
Moments Suppose that {90%} is uniformly ergodic with unique invariant density 
pO | T). Let o~p(@ | 6, T), suppose that E[h(w) | I] = h and var[h(w) | 1] 
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: ; —(M i 
exist and are finite, and let h’ = M7! yh). Then there exists finite t? 
such that 


M'PA” —hy $ NO, 2). (4.44) 


Proof: The same proof used for Theorem 4.5.2 shows that uniform ergodicity 
of {0°} implies uniform ergodicity of {90% , @™}. The result is then Theorem 5 
of Tierney (1994). E 


Theorem 4.7.2 A Central Limit Theorem for MCMC Approximation of Bayes 
Actions In addition to the assumptions of Theorem 4.5.3, suppose further that 
6” is uniformly ergodic, and that for a suitably defined open neighborhood of a, 
N(a): 


1. 07L(a, ©) /da 0a’ exists and is a continuous function of a, for all œ € Q and 
alla c N(a). 


2. M7"? oy dL (a, @”) /dala—a & N (0, B), where B is a nonnegative def- 
inite matrix. 

3. H = E[d*L (a, w)/da ða'la—ı | I] exists and is finite and nonsingular. 

4. For any € > 0, there exists M, such that 


P| sup |3 Lia, @)/da; 0a; da| < M; | i ET. 
acN(a) aa 


for all i, j,k =1,...,m. 
Then if @y is any element of Ay such that ay 44, 
M'?(@y —a) S N(O,H-'BH'). (4.45) 
Proof: The result is an application of Theorem 4.1.3 of Amemiya (1985). m 


To apply (4.44) or (4.45) in assessing numerical accuracy it is necessary to find 
a statistic 7°) 2> r? as was done for independence and importance sampling. An 
analogous approximation for the matrix B in condition 2 of Theorem 4.7.2 is also 
necessary. There are several approaches to this task. If we are willing to replicate the 
MCMC computations beginning with a randomly chosen 9 each time, comparison 
of the results provides a basis for approximation of t7; see Chan and Geyer (1994) 
and Chauveau and Diebolt (2000). The same ends can be accomplished by initiating 
several chains in the midst of the original chain, a process known as “splitting and 
regeneration”; see Mykland et al. (1995) and Robert (1995). The approach we take 
here uses a single chain, building on the following result, which is a staple of time 
series econometrics. 
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Theorem 4.7.3 Numerical Standard Errors for MCMC Suppose that 
{h(@”)} in Theorem 4.7.1 is a stationary process with autocovariance function 


cj = cov[h(o”), h@”™)] (j = 0, +1, +2,...) 


and spectral density function S(A) = Dyo cjcos(à j). If S(A) is bounded uni- 


j=% 


formly both above and away from zero on [0,7], then in (4.44), t? = S(0) = 
Dsg Cj. If 


M 
AM) = M`! >» [h(w”) Te tho") 7] 
m=j+1 


and L(M) is an integer-valued function for which limy_,.. L(M) = œ while 
limmo L(M)*/M = 0, then 


PM) = 50) =a” +2 UL —s)/LTe™ 3 S0) = t? (4.46) 


s=l 


Proof: See Newey and West (1987). E 


The condition on the spectral density function in Theorem 4.7.3 guarantees 
cj = cov[h(@™), h(w"~/))] decays rapidly enough with increasing j that it is 
possible to obtain a consistent approximation of t? = S(0) = ae c;. Given 
a modest strengthening of this condition, it is possible to investigate the question 
of whether the mean of h(w”) is the same over different segments of the entire 
simulation {h(w”)}. 


Theorem 4.7.4 Separated Partial Means Test for MCMC In addition to the 
assumptions of Theorems 4.7.1 and 4.7.3, suppose also that 


lej] < cop? (j =1,2,...) (4.47) 


for some p € [0, 1). Let p be a fixed positive integer. For each M such that 
M, = M/2p is an integer, define the p separated partial means: 


Mp 
7M) = 7A A 
h,o =M;' AO?) (j =1,..., p). 


m=1 


Let To be the estimate of t? computed for a ? described in Theorem 4.7.3 


—(M 
-K ) 


(j =1,..., p). Define the (p — 1) x 1 vector hy.” Ni jth element fv) A ip? 
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and the (p — 1) x (p — 1) tridiagonal matrix VO) in which 14 = Mp5 + 
Tp) and Uj,j-1 = Vj-1,j = <M ti. Then 
p” a (M) 4 
[Veoh Se (pad): (4.48) 


Proof: Theorem 4.7.1 implies that any linear combination of ie, has a lim- 
iting normal distribution, and consequently [see Rao (1965), Theorem 2c.5(iv)] 
7) as Pee SoM cites : 

h, has a limiting multivariate normal distribution. It remains only to show that 


limu, >œ Mpcovih i , hyp ] = 0 j #k. By virtue of (4.47) 


M, cov[h he Typ he <= a Fy o- k|Mp+n—-m RA a SS pi- m 


m=] n=1 m=1 n=1 
M, M, 
A 2M,—m M, X^ m- cop? 
< Co y po?" = Cop? p < q nae 
m=1 m=1 p 


Application of the partial means test involves choosing p as well as M. The 
theorem requires that p be fixed. Thus, for example, if we choose p = 4, then if 
M = 1,000 the partial means are based on iterations 126-250, 376-500, 626-750, 
and 876-1000, whereas if M = 40,000, the partial means are based on 5001- 
10,000, 15,001-20,000, 25,001-30,000, and 35,001-—40,000. The test will have 
power in two situations of particular concern in the application of MCMC. In the 
first h(@”) exhibits nonstationary or near-nonstationary behavior, such as that in 
a random walk or random walk with drift. In this situation RNE computed from 
the entire sequence {h(w"))} will also be quite low. The problem may be that 
serial correlation is still strong with a separation of M iterations, or, perhaps, that 
a limiting distribution does not exist (see Exercise 4.5.3). 

The second situation in which the separated partial means test has power helps 
to address the convergence question and identify a number of initial iterations B 
to discard before computing a final approximation h y_,. In aA case the separated 


partial means test may fail because the first partial mean h on reflects sensitivity 
to initial conditions and is therefore atypical of the rest of the sequence. This may 
be confirmed by examining a plot of the sequence {h(w"))} or of a sequence of 
separated partial means. We may also conduct an obvious variant of the separated 
partial means test that compares a smaller number M, of early iterations with a 
larger number M, of later iterations, taking care that the two groups are separated 
by omitted iterations, typically at least Mı. The Bayesian analysis, computation, and 
communication (BACC) software, introduced in Section 5.1, handles the choice of 
L in Theorem 4.7.3 and makes these kinds of comparisons easy. 

The separated partial means test with p = 2, applied to the 400 iterations of 
the Gibbs sampler illustrated in panels (a) and (b) of Figure 4.3, yields a value of 
1.79; the corresponding p value from the x7(1) distribution is .181. Since the test 
compares the means for iterations 101—200 and 301-400, this is consistent with 
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omitting the first 100 iterations and proceeding with the remainder as being negligi- 
bly influenced by the starting value and representative of the invariant distribution. 
When the same test is applied in the situation illustrated in panels (c) and (d) of 
Figure 4.3, the outcome is 29.29; the corresponding p value is 6.24 x 1078. This 
strong rejection is consistent with visual inspection of panels (c) and (d) and our 
earlier conclusions about convergence problems when p = 0.99. In this same situ- 
ation, using M = 10,000 iterations and a separated partial means test with p = 4, 
the separated partial means test statistic is 1.66, near the median of the x7(3) dis- 
tribution. This result would support a decision to discard the first 1250 iterations, 
and proceed with the remaining 8750 for further analysis. 

These procedures are all based on a single sequence, or run, of MCMC draws. 
Assessment of accuracy and convergence from single runs is inherently limited. 
Figure 4.2a illustrates an extreme case inwhich a single run would never detect the 
fact that the chain is reducible. Practical problems can arise from near-reducibility 
of the Markov chain. Consider the Gibbs sampler with blocks ĝa) = 6; and 0) = 
02 in the case of a multimodal bivariate posterior density like the one portrayed 
in Figure 4.4. In that case there is substantial serial correlation and sensitivity to 
the initial condition, since the probability that 6” will be near one of the two 
major modes conditional on 6“"~/ being near the other is quite small, even if j is 
quite large. If it is possible to conduct multiple runs of MCMC draws with random 
initial draws 0%, then such problems can be detected, but only if the draws 0 
are sufficiently dispersed that they have significant probability of being near each 


D 


Figure 4.4. A multimodal probability density function ill-suited to Gibbs sampling. 
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of the two major modes in situations like the one illustrated in Figure 4.4. Theorem 
4.7.4 may be applied to independent runs from initial conditions, using the means 
of the different runs (perhaps after discarding some initial draws) in place of the 
separated partial means. In this application the theorem exploits the fact that hy 
together with its NSE provides a prediction of the approximation based on an 
independent realization of the same Markov chain. 


CHAPTER 5 


Linear Models 


Chapter 2 developed the idea of a complete model, or a complete sequence of mod- 
els, as an abstract and flexible framework for Bayesian inference. The specification 


of a complete sequence of models A = {A),..., Ay} is 
P(A; | A), (5.1) 
P@a, | Aj), (5.2) 
PUY | aj, Aj), (5.3) 
p@ | y.04,, Aj), (5.4) 


where j = 1,..., J. In (5.1)—(5.4) A; denotes the model j, 0a, the Ka, x l vector 
of unobservables in model j, y the vector of observables common to all J models, 
and w the common vector of interest. In many applications the investigator’s final 
or intermediate objective is to determine p(@ | y’, A), where y° is the observed 
value of y (the data). This and the next two chapters present some specifications 
of functional forms (5.3) for the conditional distribution of observables. 

This chapter concentrates on practical issues surrounding the use of the linear 
model and some important extensions of that model. It begins in Section 5.1 by 
introducing mathematical applications software incorporating the posterior simula- 
tors described in the previous chapter, and illustrates its use in the context of the 
normal linear model first introduced in Example 2.1.2. This illustration also shows 
how the output from a posterior simulator can be used in a decisionmaking context. 
The chapter continues with the seemingly unrelated regressions model, which has 
played a central role in econometrics and can be regarded as a multivariate gener- 
alization of the normal linear model. Section 5.3 takes up the common and related 
problems of choosing a subset of covariates from a large set of potential covariates, 
and enforcing inequality constraints on the coefficients of a linear model. Finally, 
Section 5.4 develops two ways in which the normal linear model can be extended 
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to specify regressions that are nonlinear in the covariates, while remaining a special 
case of the model first introduced in Example 2.1.2. 


5.1 BACC AND THE NORMAL LINEAR REGRESSION MODEL 


The Bayesian analysis, computation, and communication (BACC) software pro- 
vides convenient tools for using the models described in this chapter and many of 
the posterior simulation methods described in Chapter 4. An important feature of 
the BACC software is that it implements its tools as extensions of the mathemat- 
ical applications Matlab, Gauss, Splus, and R, running under several variants of 
Windows, Unix, and Linux operating systems. It provides a seamless integration 
of special-purpose BACC commands with built-in general-purpose commands for 
computation, graphics, and program flow control. The user has available a number 
of models, which describe the joint distribution of observables and unobservables 
(5.2)—(5.3). The user creates model instances by selecting one of the models and 
supplying values for its known quantities—the data, and the fixed parameters of 
the prior distribution. BACC incorporates most of the models discussed in this and 
the next two chapters. It provides prior and posterior simulation using the methods 
described in Chapter 4, as well as simulation facilities for model comparison, spec- 
ification analysis, communication, and robustness analysis described in Chapter 8. 

The BACC software is available free of charge through the companion Website 
described in the preface. This site also provides complete instructions for installa- 
tion, documentation, and tutorials. The online help features of Matlab, Gauss, Splus, 
and R include the BACC extensions once they are installed. The online appendix 
provides code and output for all the examples in this book that use BACC software. 
The code is heavily annotated to introduce the reader to the use of mathematical 
applications software for Bayesian analysis generally, and to BACC in particular. 
After installing BACC and running the test programs described in the companion 
Website, it is instructive to execute the code for the examples in this section. By 
editing the code for the examples, the reader can rapidly gain familiarity and con- 
fidence with the software. Editing the code for the examples is also the easiest way 
to approach many of the exercises that require computing. 


Example 5.1.1 The Impact of Class Size on Test Scores (The online appendix 
contains data, annotated code, and output for this example.) An important decision 
made by school boards and school district superintendents is the ratio of students 
to teachers. A lower ratio is thought to improve education, including test scores on 
standardized examinations that are increasingly used to evaluate school districts. 
A lower ratio certainly requires spending more money on teachers’ salaries and 
supporting infrastructure such as classrooms. Some states have systematically col- 
lected district data on test scores, student : teacher ratios, and other factors that 
may affect educational outcomes including test scores. 
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The Massachusetts Comprehensive Assessment System (MCAS) test is admin- 
istered to all fourth-graders in Massachusetts public schools each spring. The test 
score data used in this example is the average total score from the 1998 exam- 
ination in each of the 220 Massachusetts elementary school districts. These data 
were obtained from the Massachusetts Department of Education, as were data on 
the student : teacher ratio (str), the percentage of students still learning English 
(el), and the percentage of students receiving a subsidized lunch in each district 
(lunch). In addition data on the average district per capita income were obtained 
from the 1990 U.S. census and are coded as the logarithm of income in thousands 
of dollars (income). 

This example illustrates the use of mathematical applications software and 
BACC in creating a complete model to support decisionmaking regarding stu- 
dent : teacher ratios. To get the most out of the example, the reader should follow 
the code and execute the commands while reading. We begin by finding some 
summary Statistics to gain familiarity with the data, as follows: 


Mean Median SD? Minimum Maximum 
str 17.34 17.1 2.278 11.4 27.0 
el 1.13 0 2.90 0 24.49 
lunch 15.32 10.55 15.06 0.40 76.20 
income 2.89 2.84 0.27 2.27 3.85 
score 709.83 711 15.13 658 740 


“Standard deviation. 


The sample distributions of the percentage of students still learning English 
and the percentage of students receiving a subsidized lunch are strongly positively 
skewed. The student : teacher ratio is rounded to the nearest one-tenth in the 
data, ranges from 11.4 to 27, and has a somewhat positively skewed distribution. 
After transformation to logarithms average district income has a nearly symmetric 
distribution. Average district income is almost 5 times higher in some districts than 
in others. The distribution of test scores is close to normal with a mean of about 
710 and standard deviation of about 15. 

This example uses a normal linear regression model with str, el, lunch, and 
income as covariates, and score as the outcome. (Examples 5.4.1—5.4.4 examine 
and elaborate on this specification.) The prior distribution about to be described 
reflects the assumption that the model approximates a relationship between average 
test score and the covariates that applies at the observed values of the covariates, 
and at those points permits a substantial range of behavior. For each observed 
combination of covariates x,, the scalar E(y; | x,, A) = ’x, is a random variable 
a priori. The prior takes E(B’x, | A) = u = E(y, | A) = 710, which implies that 


the prior mean of the intercept 6, is u, while the prior mean of each covariate 
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coefficient is 0. The prior distribution of the linear combinations 6’x, is multivariate 
normal, with cov(f’x,, B’x,) = 0 for all combinations t Æ s. If sample size T were 
the same as the number of covariates k, then the variance associated with each 
B’x, would be o? = var(y, | A). In order to retain the same notional sample size 
regardless of T, var(B’x;) = Tøa?/k. Thus in general the prior distribution of £ is 


B IX, A) ~ N(B, H~’, 


with 
H= (k/To7)X'X, B = (XX) Xiru = (u, 0, 2.2, 0)’. 


(The idea of using the covariate matrix to construct a prior distribution for regres- 
sion coefficients was introduced in econometrics by Zellner (1986b).) The prior 
distribution of the precision h derives from choosing the hyperparameter o? = 
var(y; | A) = 157, and considering the population multiple correlation coefficient 
R? =1—(07h)!. If (3/a7)h | A ~ x7(1), then 1 — [a7 E(h | A)]~! is about two- 
thirds, and the prior probability that R? exceeds .90 is approximately .25. 

Recall from Example 4.3.1 that the Gibbs sampling algorithm for the normal 
linear model, which is used by BACC, has excellent convergence properties and 
almost no serial correlation. Hence we use only 100 burn-in iterations, followed 
by 100,000 draws from the posterior distribution. (This should all take only a few 
seconds on a desktop or laptop computer.) BACC can provide detailed information 
about the approximation of any function of interest, illustrated here for the coef- 
ficient of str, which is central in subsequent analysis (mean, —0.6843; standard 
deviation, 0.2636): 


Accuracy of Approximation 


Numerical Relative 
Method Standard Error Numerical Efficiency 
Assuming no serial correlation 8.3364 x 1074 1.0000 
Autocovariance function tapered 7.6772 x 107+ 1.1791 
to 1.00% 
Autocovariance function tapered 7.1833 x 107+ 1.3468 
to 2.00% 
Autocovariance function tapered 7.1066 x 1074 1.3760 
to 3.00% 


There are four alternative approximations of the numerical standard error. The first 
assumes that the function of interest (here, the coefficient of str) is serially uncor- 
related. This is generally not the case in MCMC algorithms. The other three approx- 
imations use the methods described in Section 4.7, specifically in Theorem 4.7.3, 
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with L = 0.01M, L = 0.02M, and L = 0.03M. All of these methods indicate that 
relative numerical efficiency is close to 1. This reflects the fact that the Gibbs 
sampling algorithm of Example 4.3.1 for the normal linear model displays very lit- 
tle serial correlation. This is confirmed by the separated partial means test developed 


in Theorem 4.7.4, applied here to the str coefficient. 


Method Test Statistic p Value 
Assuming no serial correlation 896 
Autocovariance function tapered to 1.00% 896 
Autocovariance function tapered to 2.00% 894 
Autocovariance function tapered to 3.00% 895 


We can then compare the prior and posterior moments of the unobservables, as 
follows: 


Prior Posterior 

Unobservable Mean SD Mean SD 
intercept coefficient 710 120.23 682.61 10.551 
str coefficient 0 3.0163 —0.6843 0.2636 
el coefficient 0 3.1937 —0.4086 0.2788 
lunch coefficient 0 0.7746 —0.5173 0.0678 
income coefficient 0 33.818 16.418 2.964 
hi? — — 8.716 0.422 


Note that the ratio of prior to posterior standard deviation is the same for each 
coefficient. This is due to fact that H œ X’X, and the ratio is approximately [1 + 
ATa?/ D’. 

The marginal likelihood of the normal linear model can be calculated in two 
ways. A generic simulation method described in Section 8.2.4 provides the approx- 
imation —805.8185 of the log marginal likelihood, with a numerical standard error 
of .0030. An essentially exact calculation, using (2.80) and one-dimensional quadra- 
ture, is —805.8189. This value will be important in comparing this model with some 
variants introduced in Section 5.4. 

The sensitivity of the posterior distribution to the prior distribution can be studied 
in a number of ways, as discussed in Section 3.3. One of the simplest is to vary the 
hyperparameters of the prior distribution. In a “weak prior” variant of the model, 
H is reduced by a factor of 5, and in a “strong prior” variant it is increased by a 
factor of 5. The direction of the impact on the posterior means of the coefficients 
is predictable because H œ X’X. 
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Posterior Means with Prior 


Least a Prior 

Coefficient Squares “Weak” Original “Strong” Mean 
intercept 682.4316 682.4691 682.6081 683.4652 710 
str —0.6892 —0.6885 —0.6843 —0.6633 0 
el —0.4107 —0.4104 —0.4086 —0.3962 0 
lunch —0.5215 —0.5205 —0.5173 —0.5022 0 
0 


income 16.5294 16.5076 16.4178 15.9102 


The effect on the marginal likelihood cannot be anticipated. It turns out that the log 
marginal likelihood with the “weak” prior is —808.4703 and with the “strong” prior, 
is —808.4148. The Bayes factor in favor of the original specification, relative to 
either alternative, is about 13.9. Recall the discussion at the end of Section 3.2 that 
as H — 0, marginal likelihood is driven to zero. At the other extreme a dogmatic 
prior B = B would produce a very poor fit and consequently also a very small 
marginal likelihood. 


Example 5.1.1 incorporates two of the three elements of a complete model—the 
prior distribution and the observables distribution. These elements provide the pos- 
terior distribution, and Example 5.1.1 shows how posterior simulation methods 
and mathematical applications software can provide a useful representation of the 
posterior. This representation is exactly what we need to address the decision- 
making problems that motivate Bayesian analysis in the first place, as discussed in 
Chapter 1. The next example carries the previous example forward to two variants 
of a specific decision making problem. 


Example 5.1.2 Deciding on Class Size (The online appendix contains data, 
annotated code, and output for this example.) The determination of class size in pub- 
lic schools is a political and fiscal decision whose details vary from state to state and 
district to district. Regardless of the details, the decision ultimately made balances 
the fact that, given the number of students in the district, a lower student : teacher 
ratio is more costly, against the perception that a lower student : teacher ratio 
also increases the quality of education. The results in Example 5.1.1 support this 
perception, to the extent that we are willing to identify higher test scores with 
increased quality of education. The effect is not large: E(B, | y°, X, A) = —0.6827 
implies that the difference between the highest student : teacher ratio of 27 and 
the lowest student : teacher ratio of 11.4 in the 220 Massachusetts elementary 
school districts accounts for a difference in test scores of about 10.6, considerably 
less than the sample standard deviation in test scores. Moreover, the distribution 
B, | Y°, X, A) implies substantial uncertainty about this effect. In this example we 
consider two loss functions that, together with the posterior distribution developed 
in Example 5.1.1, enable us to derive the corresponding optimal student : teacher 
ratio—the Bayes action. 
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The first loss function assigns a dollar value to the test score of each student. 
(This could represent either a social consensus manifest in local school boards, or 
it could be an explicit subsidy provided from the state of Massachusetts to local 
school districts.) Let T be the number of teachers in the school district and S the 
number of students. Suppose that the cost of each teacher is c, including not only 
the teacher’s salary and benefits but also the annual cost of the additional facilities 
and support required for each teacher. Let d be the value placed by the school 
district on each test point for each student in each year. Then the loss function is 
cT — dS œ, where w is the average test score in the district. (The notation reflects 
the fact that average test score is the vector of interest.) Dividing by S, we obtain 
the equivalent loss function 


LG - edo (5.5) 


where a is the student : teacher ratio (the Bayes action, in this decision prob- 
lem). We shall refer to (5.5) as the “average score” loss function. In the model 
of Example 5.1.1, œ = y + pa + £, where y denotes the effect of all covariates 
other than the student : teacher ratio, and is specific to each school district. The 
expected loss is 


E[L(a, w) | data] = - — dV + Ba), 
where the overbars denote posterior means. Simple calculus shows that the Bayes 
action is a = (—c/dB)'/*. For example, if c = $100,000, d = $250, and B = —1, 
then a = 20. The Bayes action depends only on the relative values of c and d. 
This computation is simple, and it is easy to use alternative assumptions about 
c/d and alternative prior distributions. Continuing where we left off in Example 
5.1.1, we easily find the following Bayes actions. 


Bayes Action a for c/d = 
Prior 150 200 250 


“Weak” 14.7604 17.0438 19.0556 
Original 14.8056 17.0961 19.1140 
“Strong” 15.0379 17.3642 19.4138 


Changing the prior precision by a factor of 5 has a small effect on the Bayes 
action. On the other hand, changes in c and d are important. If the annual cost of a 
teacher and supporting staff and facilities is $100,000, then the difference between 
valuing each test score point per student at $500 as opposed to $400 accounts for 
a decrease in the optimal student : teacher ratio from 19 to 17. (Recall that 17 is 
the mean student : teacher ratio for all districts.) Similarly, if c = $100,000 and 
d = $500, a 10% increase in teacher costs would result in almost one more student 
per class, on average. 
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An alternative approach to examining sensitivity to the prior distribution is to 
vary the prior over the density ratio class, as discussed in Section 3.3. (Section 8.5 
presents the corresponding computations.) For the stated prior density p(04 | A), 
BACC computes upper and lower bounds on posterior moments over all prior densi- 
ties with a kernel k(04 | A) satisfying r~! - p(04 | A) < k(04 | A) <r -k(04 | A), 
for specified r > 1. In the case of the average score loss function (5.5) the Bayes 
action a varies inversely with B,. Therefore bounds on a may be derived from 
those on B2. For r = 2, 4, 8, we find 


Density Ratio Factor r 2 4 8 
Lower bound on a 14.0773 15.5338 16.6720 


Upper bound on a 15.6607 19.2508 23.0857 


In the alternative loss function the school district receives a cash transfer of d 
dollars per student if the average test score in the district exceeds a target t. Then 


L(a, œ) = - + dlon (o). (5.6) 


We shall refer to (5.6) as the “threshold average score” loss function. Conditional 
on the unobservables 6 and h, we obtain 


E[L(@, @) | y, Bh] = — +dPly + Ba +e <1) 
=< +dP(e <1 — y — pa) = — +do[h'?(t — y — Ba)], 


where ®(-) is the cdf of the standard normal distribution. Then the applicable risk 
function is 


E[L(a, w) | y’, X, A] = - +a { Ph? E — y — Ba)|ply, Bh ly’, X, A)dO4, 


O4 
where 04 = (y, B, h)’. The first-order condition is 


Ea | oiha- y — pah? Bp, B.h ly’, X, Ada 


a? Oa 
= < — dE{o[h! (t — y — ba)]h!?B | y°, X, A} = 0. (5.7) 


Note that covariates other than the student : teacher ratio affect the Bayes action, 
given the threshold average score loss function, whereas they did not, given the 
average score loss function. 
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Given the output of the posterior simulator, an approximation of (5.7) is 


M 
a = dM”! X on™ = y™ a Bayh? B™ = 0. (5.8) 
a 


m=1 


The root of (5.8) can be found by evaluating the left side of the expression over a 
suitable range of potential actions and then interpolating. (Theorems 4.5.3 and 4.7.2 
provide the formal justification for this procedure.) From (5.8) the result depends 
on the parameters of the loss function only through the ratio c/d. Suppose the 
target test score is £ = 710. Then Bayes actions for a hypothetical school district 
with sample median values of el, Lunch, and income, which enter through y, are 
as follows: 


c/d 30 40 50 


Bayes action a 16.3734 17.9941 19.6036 


Exercises 5.1.2 and 5.1.4 explore the sensitivity and interpretation of these results. 


Exercise 5.1.1 Prior Sensitivity in Example 5.1.1 This exercise explores in 
greater detail some of the findings about prior sensitivity in Example 5.1.1. 


(a) By experimenting with appropriate variations of the coefficient vector 
prior precision matrix H, demonstrate that marginal likelihood is mono- 
tone decreasing for values of H larger than that used in the “strong prior” 
as well as for values of H smaller than that used in the “weak prior.” (It is 
easiest to answer these questions by modifying the code for Example 5.1.1.) 

(b) Why does the Bayes action a for the average score loss function (5.5) 
increase as the prior precision of B increases? (Remember that B = 
(u, 0,...,0)’ in all of these prior distributions.) 7 


(c 


<~ 


In the examples given for the average score loss function (5.5), why did the 
variation of the prior distribution within the density ratio class have a much 
more substantial impact on the Bayes action than did the variation in H? 


Exercise 5.1.2 Prior Sensitivity in Example 5.1.2 Explore the sensitivity to the 
prior distribution of the Bayes action a for the loss function (5.6) to variations in 
the prior distribution, in the same way as was done for the loss function (5.5) in 
Example 5.1.2. 


Exercise 5.1.3 Predictive Distributions Consider two hypothetical school dis- 
tricts, one with str = 15 and the other with str = 20. The values of all other 
covariates are the sample medians. Let w denote the difference between the average 
test score in the school district with str = 20 and the district with str = 15. 
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(a) Find the predictive mean and standard deviation of w. 

(b) Fix the values of the coefficients B at the least-squares estimates b and 
fix the value of o? = h7! at o? = (y? — Xb)/(y? — Xb)/(T — k). Find the 
corresponding mean and standard deviation of w. This computation ignores 
the uncertainty about the unobservables in the model. Compare the results 
with those in part (a). 

(c) Find P(w > 0 | y’, X, A). 


Exercise 5.1.4 Interpreting Observed Student : Teacher Ratios in Example 
5.1.2 There is substantial variation in the average student : teacher ratio across 
school districts. This exercise investigates the extent to which these ratios can be 
interpreted as Bayes actions in the context of Example 5.1.2. 


(a) In the case of the average score loss function (5.5), the Bayes action does 
not depend on the values of the covariates (el, lunch, income). If str is a 
Bayes action, then observed variation in str can be due only to the variation 
in c/d. Find the value of c/d corresponding to the student : teacher ratio 
in each district and display the result as a histogram. 

(b) In the case of the threshold average score loss function (5.6) the Bayes 
action depends on e1,lunch, and income. Determine the Bayes action 
corresponding to c/d = 40 in each school district. 

(c) Are the Bayes actions in part (b) positively correlated with the observed 
student : teacher ratios across school districts? 

(d) Is there a value of c/d in each school district that fully accounts for the 
observed values of str in the case of the threshold average score loss 
function (5.6)? If so, is there more, or less, variation in this ratio than was 
the case in part (a) of this exercise? 


Exercise 5.1.5 A Second Decision Problem in Example 5.1.2 The data spread- 
sheet provided for Examples 5.1.1 and 5.1.2 also includes, in column 16, the 
average teacher’s salary in each school district. In this example, assume that the 
cost of each teacher is twice the teacher’s salary. 


(a) Assuming that str is the Bayes action, find the value of d in each school 
district, for the average score loss function (5.5). 

(b) Are the values of d found in (a) systematically related to the covariates el, 
lunch, and income? 


5.2 SEEMINGLY UNRELATED REGRESSIONS MODELS 


The seemingly unrelated regressions (SUR) model developed in Zellner (1962) is 
perhaps the most widely used econometric model after linear regression. The reason 
is that it provides a simple and useful representation of systems of demand equations 
that arise in neoclassical static theories of producer and consumer behavior. 
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Two widely applied models in the theory of production provide examples 
of the SUR model. If the m x 1 vector w denotes factor prices facing a pro- 
ducer of output y with cost function c(w, y), then from Shephard’s lemma 
(Varian 1992, Section 5.4) the corresponding m x 1 vector of factor demands is 
x(w, y) = 0c(w, y)/dw. Given a functional form for c(w, y), factor demands can 
be derived explicitly. 

The generalized Leontieff (or Diewert) cost function (Varian 1992, Section 
12.10) is 


m m m 
1/2, 1/2 
c(w,y)=y> y bjw! w +) WE; 
i=l 


i=1 j=l 
where b, = b;, and, defining € = (€),..., Em)’, € |(w, y) ~ N(O, £). Then 


m 


xw y) =y di(wj/wi)'? +e; (i =1,...,m). 


j=l 


Note that there are m equations, each of which individually satisfies the observables 
specification in the normal linear regression model, but that most parameters appear 
in two equations and the disturbance terms in the different equations are allowed 
to be correlated. 

The translog cost function (Varian 1992, Section 8.4) is 


m m m 


1 
log c(w, y) = ay + J ` a; log w; + 5 2 2 bij log w; log wj 


i=l i=l j=l 


+logy + y log(w;)£; 


i=l 
in which yahi ai = 1, e |(w, y) ~ N (0, £), and for all i = 1,..., m: 
bj =bi G =1,...,m), J` bj =0. 
j=l 


Since dlogc(w, y)/d log w; = [dc(w, y)/dw;]- (w;/c) (i =1,...,m), the cost 
share of the ith factor is 


Ww; x; (W, y) 


=ai + bjlogw; +e; (i =1,...,m). 
c(w, y) E á J 


j=l 


Once again there are m equations, each of which individually satisfies the observ- 
ables specification of the normal linear regression model. Most parameters appear 
in two equations, and the disturbance terms in the different equations are allowed 
to be correlated. 
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In the seemingly unrelated regressions model, A, the relations of interest are 


Yj = Z;ß+ej G =1,...,m). 


Txl T xk 
Let 
yı Zı £1 
— s Le = > e = 
Tmx1 ý Tmxk z Tmx1 : 
Ym Zm Em 


The disturbance vector e is normally distributed. Components of e are uncorrelated 
across the observations t = 1, ..., T, but may be correlated across the m equations; 
thus, cov(e;, €j) = o ;jIr and the m x m matrix X = [ø ;] is positive definite. Then 
e |(B, £, Z, A) ~ N(0, £ & Ir). 

A special case of the SUR model, sometimes the focus of textbook discussions 
[see, e.g., Greene (2003), Section 14.2] is 


yj = Xj Bj +64; (Gj =1,...,m). (5.9) 
TXI Txkj 


In this case 


_ 0 xX; 0 = 
2i E | TxD ki í Tx i=j ki | aud oe ki a 


To take another special case, for the Leontieff cost function with two inputs 


yı yı(wza/w1) 0 


yr yr(w2r/wir)'/? 0 bi 
oT x3 a 0 (w Jw yi and B = bi2 
yı (wıı l 21 i baz 
0 yr(wir/wr)"? yr 
In general the SUR model may be written 
y=ZB +e, (5.10) 
e |(B, H, Z, A) ~ N(0, H~! 8 Ir). (5.11) 


The m xm matrix H = £~! is the precision matrix of the disturbance vector 
(Eit, +++, Emt)’. The formulation (5.10) permits linear cross-equation constraints on 
the coefficients, whereas the more specific case (5.9) does not. 
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From (5.10)—(5.11) the pdf of the observables vector y is 


p(y | B, H, Z, A) = (2x) ™? H|"? exp[—(y — ZB) H 8 Ir)(y — ZB)/21. 
(5.12) 
Define the residual cross-product terms s; = (y; — Z;B)'(y; — Z;B) and m x m 
matrix S = [s,], and observe 


m m 


(y — ZB) (H @ Ir)(y — ZB) =} Yi -Zp (yj — ZjByhy =tr SH. (5.13) 
i=1 j=l 
An alternative expression for (5.12) is therefore 


p(y | B, H, Z, A) = Q2)77"”? |A|?” exp(—tr SH/2). (5.14) 


Define f = [Z(H @1,)Z]"'!Z'(H @ Ir)y and observe Z/(H & Ir)(y — ZB) = 0. 
Hence (5.12) may also be expressed 


p(y | B.H, Z, A) = 27) 7” H7 exp[—(y — ZB) H 8 Ir)(y — ZA)/21 
-exp[—(B—B)’'Z'(H ® Ir)Z(B—B)/21. (5.15) 


If y° replaces y in (5.14) or (5.15), then these expressions, interpreted as func- 
tions of B and H, provide alternative representations of the likelihood function. 
From (5.15) it then follows that the conditionally conjugate prior distribution for 
B is B ~ N(B, Hz’): 


p(B 14) = (20) *? |H)? 


exp[—(B — B)'H,(B — B)/2]. (5.16) 

The conditionally conjugate prior density function for H has the functional form 
(5.14). The kernel is that of the Wishart distribution [see Johnson and Kotz (1972), 
Chapter 38 or Anderson (1984), Section 7.2] for m x m random positive definite 
matrices A: 


m =i 
pA | ES re 5] aT retin] 
i=1 


[AO exp(—trn7!A/2). (5.17) 


The corresponding specific distribution is the Wishart distribution with positive 
definite matrix parameter Z and degrees of freedom parameter v > m, usually 
denoted A ~ W(X, v). It is the distribution of the random matrix A = )°?_, x;x/ 


if v is an integer and x, N(O, £), and thus the marginal distribution of aji 
is o; ii ~ x7(v). This genesis of the Wishart distribution provides a simula- 
tion method if v is an integer. The following algorithm, due to Anderson (1984), 
Section 7.2, does not require v to be an integer and is more efficient unless v is a 
small integer: 
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1. Compute the lower triangular Choleski factorization P of £, © = PP’. 

2. Simulate a lower triangular m x m matrix B with mutually independent ele- 
ments: b? ~ x2 —it+1)G@=1,...,m), and bj ~ N(O,1)@ > j). 

3. Let C = BB’; then C ~ WI, v). 

4. Set A= PCP’. 


The conditionally conjugate prior distribution can be expressed H |A ~ W(S"!, 
v), and then 


m -1 
pŒ |A) = qQ72m/2 p =mm-—1)/4 Is? fi heat = | (5.18) 


i=1 


. |H|2-!-? exp(—tr SH/2). (5.19) 


The “notional data” interpretation of this prior distribution is the information 
about precision from v i.i.d. m-variate normal observations with sums of squares 
and cross-products matrix S. The prior mean of H is vS~', and ("hy |A~ 
xŒ) (j = 1,...,m). This prior distribution is therefore a generalization of the 
prior distribution of h in the univariate normal linear model (2.12). 

From (5.15) and (5.16) the conditional posterior density kernel for B is 


p(B | H, y°, Z, A) œ exp{—[(B—B)'Z'(H 8 Ir)Z(B—B) 
+ (B — B)'H,(B — B)1/2}. 
Hence the conditional posterior distribution is £ |(H, y’, Z, A) ~ N(B, H; ). with 
Hy = H; + Z(H 8 Ir)Z 
and 
B = H; [H,f + Z(H @Ir)y’] = Hy [Hp + Z(H 8 Ir)ZÊ]. 
From (5.14) and (5.19) the conditional posterior density kernel for H is 
pH | B,y’, Z, A) œ [HCT -12/2 exp{—tr [(S + S)H]/2}, 
whence 
H|(B.y’,Z, A) ~ WIS +S)', v +T]. 
These conditional posterior distributions provide the Gibbs sampling algorithm, 
first proposed by Percy (1992). BACC incorporates the seemingly unrelated regres- 


sions model using the conjugate prior distribution and posterior simulation algo- 
rithm described in this section. 
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Exercise 5.2.1 Missing Data Suppose that 


paid: 
Yy = On, yD) ~ N(w, ZX) t= 1,...,T). 


Some of the observations y,; are missing at random. (Recall the definition in 
Example 2.2.3.) 


(a) Find p(yn | y2, M, z, A) and p(yn | yn, M, Z, A). (Hint: You may find 
Theorem 5.3.1 useful.) 

(b) Develop a Gibbs sampling posterior simulator for this model, assuming 
conditionally conjugate prior distributions. 

(c) In the context of the model in (b), show that if both y,; and y,;2 are missing 
for given f, then nothing changes if this observation is simply excluded from 
the sample. Is the algorithm in (b) more efficient if observation ¢ is included, 
or if it is excluded? 


Exercise 5.2.2 Completing the Argument Derive (5.14) and (5.15) from (5.12). 


Exercise 5.2.3 The Wishart and Gamma Distributions Show that the Wishart 
distribution for 1 x 1 matrices is the gamma distribution. Specifically, if h ~ 
W(1/s?, v), then s?h ~ x7(v). 


Exercise 5.2.4 An Auction Application At a second price auction, bids are writ- 
ten and sealed. The object is sold to the highest bidder. The sale price is that bid 
by the second-highest bidder. Suppose you have data on T auctions, each of which 
has the same n bidders. The number n is small (say, 4 or 5) while T is large (say, 
100-200). For each auction, you know the identity of the winning bidder and the 
price he offered. You also know the identity of the second-highest bidder, and the 
price she offered (which in turn is that paid by the winner). You do not know the 
bids of the other n — 2 bidders. 
Suppose that your model is 


Vie = B;Xir + Eu G=1,...,m;f=1,...,T); 
E1 = (Eits -3 Enty; E | ŒH, A) ~ N(0, H`’) @=1,...,T). 


The k x 1 vector x; quantifies characteristics of bidder i and the object offered 
at auction t. The n x 1 vector Y; = (Yir, - - -, Yar)’ contains the valuations of the n 
bidders for the object offered at auction t. A standard result in elementary auction 
theory [see, e.g., Fudenberg and Tirole (1995), Example 1.2] is that each bidder 
will bid his or her valuation of the object being sold. 

You observe all of the x (i = 1,...,n;t = 1,..., T). But you observe only 
the two largest elements of (Yir, --., Ynt) as described above. The prior distribu- 
tion has two independent components: B |A ~ N ($, H~’) and H |A ~ W(S7!, v). 
Construct a posterior simulator that provides random B“” and H™ whose invari- 
ant distribution is the posterior distribution. [For other auction applications using 
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similar tools, see Sareen (1999, 2003), Bajari and Lee (2003), and Albano and 
Jouneau-Sion (2004).] 


Exercise 5.2.5 Panel Data and Random Coefficients A random coefficients 


model for panel data applies to each of n households (i = 1,...,) observed in 
each of T time periods (t = 1,..., T). For household i in time period t, we have 
Yit = BiXit + Eit 


in which x; is a vector of covariates that may be regarded as fixed or ancil- 
lary; ex | (h, A) ŻF N(O, h7’); and B; | (B, H, A) $È N(B, H7!). The nT random 
unobservable disturbances and the n unobservable random vectors 6; are mutually 


independent. The observables are (x;,, ya) (i = 1,...,2; t=1,...,T). 


(a) Assume that £, H, and h are independent in the prior distribution. Write 
down a normal prior density for B, a Wishart prior density for H, and a 
gamma prior density for h, including the kernels of these densities. 


(b) Express the probability density function for B,,..., 8, conditional on B 
and H. 

(c) Express the probability density function for y;(@@ = 1,...,n; t=1,...T) 
conditional on B; (i = 1,..., n), Xy (i =1,...,n,t =1,...T), and h. 


(d) Using the expressions from (a), (b) and (c), write down the posterior density 
kernel for £, H, h, and B,,..., B,. 

(e) Formulate a Gibbs sampling algorithm for the posterior distribution. Clearly 
indicate each step. 


(f) Will the Gibbs sampling algorithm in (e) converge to the posterior distribu- 
tion? (Briefly justify your answer.) 


(g) Suppose that you have available X; r+1, the covariate vector for the first 
household in the data set in the next period, but not y;,7+;. Assuming that 
the same model continues to apply to household 1 in period T + 1, show 
how you could use the output of the Gibbs sampler from part (e) to obtain 
draws from the predictive distribution for y, 741. 


(h 


~ 


Suppose that you have available X„+1,r+1, the covariate vector in the next 
period, from a household not in the data set, but not yY„+1, 7+1. Assuming 
that the same model applies to household n + 1 in period T + 1, as to the 
households and time periods in the data set, show how you could use the 
output of the Gibbs sampler from part (e) to obtain draws from the predictive 
distribution for Yn+1,T+1- 


Exercise 5.2.6 Hierarchical Priors and Unbalanced Data in the SUR Model 
Suppose that you are interested in a set of relationships 


/ 
Yit = B;Xit + Eir 


LINEAR CONSTRAINTS IN THE LINEAR MODEL 169 


The subscript i denotes countries, the subscript t denotes time, and each covariate 
vector X; is k x 1. There are n countries and T time periods, and n is small 
relative to T. Here are some alternative specifications of the prior distribution, the 
observables distribution, and the observables. 


(a) The prior distribution: 

(i) The coefficient vectors 6B; are all the same. 

(ii) The coefficient vectors are similar but not identical. 
(b) The observables distribution: 

(i) The disturbances £; are i.i.d. (across both i and t): 


£n | (h, A) ~ N(0,A7'). 
(ii) The disturbance vectors €; = (€1t, .-., Ent) are i.i.d.: 
e, | (H, A) ~ N0, H’) @=1,..., T). 


(c) The observables: 


(i) {Xit, Yu} are observed for all i =1,...,n and t = 1,..., T. 
(ii) {Xx} are observed for all i = 1,...,n and t = 1,..., T. However, for 
each country i, {yj} is observed for the time periods ih pana ae where 


Peer =F, 


The objective in this exercise is to carefully complete the model consisting of 
(a.ii) and (b.ii) with a proper prior distribution, derive the posterior distribution 
corresponding to this prior distribution and the observed data indicated in (c.ii), 
and construct a simulator for this posterior distribution that provides simulation- 
consistent approximations of posterior moments. 

If you can do this in one step, then do so. But you may find it easier to consider 
simpler variants of the model first [say, (a.i) plus (b.ii) plus (c.i)], and then take 
advantage of the “building block” character of many MCMC algorithms. (The 
combination of conditions (a.i), (b.1i), and (c.i) is fully covered in this section.) 


5.3 LINEAR CONSTRAINTS IN THE LINEAR MODEL 


In the normal linear regression model introduced in Example 2.1.2 it is common 
to impose, or at least entertain, restrictions on the coefficient vector B that go well 
beyond the prior distributions for B considered to this point. This is evident in dis- 
cussions of whether coefficient estimates “have the right sign” and in procedures 
such as stepwise addition and deletion of covariates. If these restrictions are not 
incorporated formally in the specification of the model, then it is impossible to pro- 
vide appropriate statements of uncertainty, taking either a Bayesian or non-Bayesian 
approach. A case of particular concern is that in which the investigator begins 
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with a long list of potential covariates, and then following some data manipulation 
(e.g., stepwise deletion of variables) presents least-squares coefficient estimates and 
accompanying standard errors. The latter do not have a classical, sampling-theoretic 
interpretation, since there is no accounting for outcomes in which other covariates 
would have been selected. Nor do they have a Bayesian interpretation even in 
the context of Example 3.2.1, because they correspond to a prior distribution that 
excludes the deleted covariates with certainty. 

Example 4.2.1 introduced the general problem, and discussed why acceptance 
sampling algorithms like those first proposed to handle the problem in Geweke 
(1986) are often impractical. Example 4.5.2 and Exercise 4.3.3 took up more spe- 
cific algorithms in particular cases. This section generalizes the latter approaches. 
While the generalization here includes many of the cases that arise, it is by no 
means exhaustive, and approaches similar to the one described here can often be 
applied in other instances. The analysis focuses on formulating the prior distri- 
bution of B to incorporate the restrictions, and on the posterior distribution of B 
conditional on all other parameters in the model as well as the data. Consequently, 
application of the analysis here extends well beyond the linear model, to models 
that are equivalent to the linear model given suitable conditioning. In particular, 
the methods in this section can be used in conjunction with the Gibbs sampling 
algorithms developed in Sections 5.2, 6.2, 6.1, 6.4, and 7.1. 


5.3.1 Linear Inequality Constraints 


As a motivating example, suppose that the observables are the k —1 inputs 
Xj5,+-+,Xj and the output y* from each of T firms. The model incorporates a 
Cobb-Douglas production technology 


k 
log yř = log By + È B; log x}. 
j=2 


The unobservable 8,, varies across firms due to different fixed factors for each firm. 
For all T firms denote x; = 1, x; = log xý (j =2,...k), and y; = log(y;). Take 
X = [x] and let y = (y1, ..., yr)’. If model A specifies log B,, | (81, h, A) as 
N(f,,h7') and the independent prior distributions B |A ~ N(B, H`!) and s?h | 
A ~ x? (v), it is then a special case of the normal linear model ‘of Example 2.1.2. 
However, the investigator also believes that £ j> OV =2,...,k). 

This example is an instance of a more general set of restrictions 


a<B<w (5.20) 


in the linear model, where it is understood that elements of a may include —oo, and 
those of w may include +00, as well as real numbers. In the motivating example 
a, = —0, a; =0 (j =2,...,k) andw; = œ (j = 1,...,k). Thus (5.20) includes 
the possibility of up to k nonredundant linear inequality restrictions, but no more. 
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Inequalities are taken as strict in (5.20) without loss of generality, since both prior 
and posterior distributions of 6 are absolutely continuous. Inequality restrictions 
of the form 


a< DB < w, (5.21) 


with D nonsingular, may be accommodated by appropriate reparametrization of B 
and a corresponding transformation of the columns of X (see Exercise 5.3.1). 

The posterior density kernel is that in Example 2.1.2, truncated to the set 
{8 :a < B < w}. Consider a Gibbs sampling posterior simulation algorithm that is 
fully blocked in A, B,,..., p. The conditional posterior distribution of h remains 
the same, given by (2.25)—(2.26). Conditional on h, X, and y, B ~ N(B, H’), 
with H and B given by (2.19) and (2.20), subject to the restrictions (5.20). In con- 
structing the posterior distribution of 6; conditional on 6;(i # j), h, X, and y°, 
the following result for the multivariate normal distribution is useful. 


Theorem 5.3.1 Conditional Multivariate Normal Distribution Let 


x M Zx Èx | 
Z= ~ N(p, £) with = x and Y= X |. 
( y ) aaa Š ( My ) | Eys Lyy 


Denote the corresponding precision matrix 


H= x7! = Ge sa 
Hyx Hyy 


Then the distribution of y conditional on x is normal with variance 
Lyx = Lyy — Pale hy = H,,' (5.22) 
and mean 
Myx = hy + Lyx DZ (X — Hy) = My — Hy Hyx (x — Hy). (5.23) 


Proof: The first equalities in (5.22) and (5.23) are derived in many texts on 
multivariate statistics; see Anderson (1984), Section 2.5.1 and Johnson and Kotz 
(1972), Section 35.3, for examples. The second equalities are then a consequence 
of a standard expression for the inverse of the partitioned matrix X (Anderson 
1984, Section A.3.2; Greene 2003, Section A.5.3): 


(Xxx — Ery Diy Lyx)! -En Lay (Zyy = Lyx By Ery)! = 
—(Zyy — Lyx Bay Ey) Lyx Day (yy — Lyx By Lay)! 


Applying Theorem 5.3.1 to the conditional posterior distribution of £ ;, we obtain 


B; IB # j), h, Y°, X, A1 ~N | B,- h; XO mii- By), hy 
ižj 


172 LINEAR MODELS 


subject to aj < 8; < wj. Draws from these truncated univariate normal distribu- 
tions may be made efficiently using the algorithm described in Example 4.2.2. 
This provides the basis for a Gibbs sampling algorithm, first proposed in Geweke 
(1996b). 


5.3.2 Conjectured Linear Restrictions, Linear Inequality Constraints, 
and Covariate Selection 


In the motivating example of Section 5.3.1, suppose that the investigator thinks that 
B,,; may be related to some other observable characteristics of the firm. If the firm is 
a farm, for instance, these might include a measure of land quality, the educational 
attainment of the farm manager, and an indicator of whether the manager is also 
the owner. If the investigator assumes a linear relationship between log B,, and 
observable characteristics X;,¢41,..., X:4+p, then log y*¥ = ee B jX1j + £r, where 
£, reflects the unobservable determinants of log 6,,. The investigator is uncertain 
whether each observable characteristic of the firm really influences log 6,1, given all 
the other characteristics and the inputs, but is willing to assume that the influence is 
nonnegative. A prior distribution in which 441, - - -, Bk+p are mutually independent 
and independent of 6,,..., 6, might then take the form 


P; =01A)=p,, 
B; | B; #0. A) ~ NB h7’) 


subject to $; > 0 for j =k+1,..., p. 

To provide a general analytical treatment, assume the specification of the normal 
linear model of Example 2.1.2, except that in the prior distribution the coeffi- 
cients 6 ; are mutually independent: P(B; = 0 | A) = Pp and if 6; # 0, then B; ~ 
N(B hz’) truncated to aj < Êj < wj. As in Section 5.3.1, aj = —oo, wj = œO, 
or both, are possible. The properly normalized prior density of 6; is then 


p(B; | A) = p,80o(B;) + (1 = p Ay expl—hjB; = B,)?/2Mea,.w)(B))s 


where 6 denotes the Dirac delta function defined in (4.18), and 


Aj = (fh; (w; - BI Dhj (a; - Bay h., 
where © is the cdf of the standard normal distribution. 

Because it specifies that the elements of £; are independent a priori, this model 
does not include the motivating example, nor does the model in Section 5.3.1 corre- 
spond to the special case p. = O(j = 1, ..., k). The modifications of the treatment 
below are fairly straightforward in each case, but the notation becomes more cum- 
bersome. For treatments of more general cases, see George and McCulloch (1993, 
1997), Geweke (1996a), Chipman et al. (1998), and Brown et al. (1999). Concen- 
trating the prior point mass of 6; on 6; =0 is innocuous. As formulated, this 
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specification includes the conventional selection of regressors problem, in which 
the investigator is uncertain of which in a list of covariates really belongs in a 
model, and may also wish to impose sign restrictions. This could be cast as a 
model combination problem, but since there are 2% possible models, this approach 
becomes unwieldy for more than (say) a half-dozen candidate covariates. 

The posterior distribution can be sampled using a Gibbs sampler fully blocked 
in B,,..., g, h. The conditional posterior distribution of h is the same as that in 
Example 2.1.2. For each 6;, we obtain 


T 
+ (1 = p Jj expl—h (Bj — B,)°/2Ma;,w;)(B))} exp |- X @-g a 
t=1 


(5.24) 


where z; = y; — igi B;X,;. From (5.24), if Êj +Æ 0, then Bj ~ N(B;, hy) subject 
to B; € (aj, wj), where 


T T 
= > cl 
hj=hj+h) Xe and $; =h; TAD: a). 


t=1 T=1 


Furthermore 
T 
_ i ; ó 2 
PIB; =0 | Bi # j), h, Y°, X, A] & p, exp (222) 


and 


PIB; #0 |B: # j) h, y’, X, A] x -p àj 


Wj T 
i exp[—h (Bj — B )”/2lexp È X -ep pap dp;. (5.25) 
4 t=1 


J 


Completing the square in the last line yields the closed-form expression 


HAI 1/2 


= (2x) h, ton, (wj —B)1— Oh, (aj; -Bp 


a tad FB) | (5.26) 
for the integral (5.25). Thus 


HyS ?/2) 
pjexn(-h ae 


en e a l ae (5.28) 
2 
p exp (-n D /2) +01- pjk; 


Pj 
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In each iteration of the Gibbs sampling algorithm some of the coefficients in £; 
are zero and some are not. A simulation-consistent approximation of P[B; = 0 | 
y’, X, Aļis M'Y% SB”, 0). [The expression 8B”, 0) is an instance of the 
Kronecker delta function, defined (a, b) = 1 if a = b and (a, b) = 0 if a Æ b.] 
An alternative approximation can be based on the evaluation ae of (5.28) made 
in each iteration. Since 


p” = El go o | BOGS DB (i > j), h™,y°,X, A], 


M7! sae py” is also a simulation-consistent approximation. The principle of 
concentrated expectations (Theorem 4.4.1) suggests that the variance of this 
approximation will be smaller. This generally turns out to be the case. It can 
be especially advantageous when M - P(B; =0|y°, X, A) << 1, in which case 
mM er 5 gm 9 = 0 is a likely outcome and assessment of numerical standard 
error of this approximation is impossible. The same is true when 


M[l — P(B; =Oly’,X,A)] <1. 


Exercise 5.3.1 Transformation for Inequality Constraints In the motivating 
example for linear inequality constraints in Section 5.3.1, suppose that the addi- 
m 


tional inequalities B; < 1(j =2,...,k) and 0 < es) B; < 1 (the latter corre- 
sponding to diminishing returns to scale) were imposed. 


(a) Show that these inequalities cannot be expressed in the form (5.21). 
(b) Adapt the methods of Section 5.3.2 to sample from the posterior distribution. 


Exercise 5.3.2 Improving Efficiency Suppose that in the inequality-constrained 
linear model B’ = ($1, B2), a1 < B; < wi, where B, is kı x 1, but B, is uncon- 
strained. There is a Gibbs sampling algorithms for the posterior distribution that 
uses k; +2 blocks and is generally more efficient than the one described in 
Section 5.3.1. Describe the algorithm, indicating explicitly how the elements of 
B, are drawn. 


Exercise 5.3.3 Order-Restricted Inference Suppose that 


The prior belief about the precision matrix H is H|A ~ W (v, S"!). 


(a) Suppose that the prior belief about mw is that y; > uz >--- > up and that, 
subject to this restriction, 4; — M41 ii y (0, h7!). Adapt the Gibbs sam- 
pling algorithm of Section 5.3.1 to this model. 

(b) Suppose instead that the prior belief about p is that u; — M j+ are indepen- 
dent and identically distributed. P (u; — u ;}1 = 0) = $3 if uj — Mj41 #9, 
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then uj — Hj ~ NO, h`!) subject to [Lj — Hj4ı > 0. Adapt the Gibbs 
sampling algorithm of Section 5.3.2 to this model. 


(c 


wa 


Consider an alternative model in which x, =--- = y2,. Express a condi- 
tionally conjugate prior distribution for this model. How would you find the 
Bayes factor in favor of this model, relative to the one in part (b)? 


(d 


~x 


Now suppose the additional complication that some of the components x; 
are missing at random (recall Example 2.2.3). How would you modify the 
posterior simulation algorithms in parts (a) and (b) to accommodate this 
complication? 


Exercise 5.3.4 Inequality Constraints and Model Probability In the normal 
linear regression model y ~ N(XB, h—'I) the precision parameter h is known to 
be h = 1, and X’X = D, ak x k diagonal matrix D = diag(d,, ..., dg). Moreover, 
b = (X'X)"!X’y’ = 0. Consider three variants of this model, distinguished by the 
prior distribution of B. 


e In model 1, B |A; ~ N(O, H). H is a diagonal matrix: 
H =diag(h,,...,h,). 


e In model 2, B |Az = 0. There is no uncertainty about ĝ in this model. 


e In model 3, the coefficients 6; are independently distributed. With probability 
Pp Bi =0. With probability 1 — P; i has a half-normal distribution with 
precision h;: 


p(B: | Bi #0, A3) = (2h; /x)"? exp(—h; B? /2) 10,00) (Bi). 


(a) Find the marginal likelihood in models 1 and 2. [You may find (2.80) useful.] 

(b) Find the marginal likelihood in model 3. 

(c) Rank the models according to their marginal likelihood, and explain the 
ordering. 

(d) Find P(B = 0 | y’, X, A3). Is the result surprising, given the ordering in 
part (c)? 


Exercise 5.3.5 Completing the Argument Derive (5.26) from (5.25). 


5.4 NONLINEAR REGRESSION 


The normal linear model provides a convenient but restrictive representation of the 
distribution of y conditional on X. This section weakens the assumption that the 
regression function is linear in x, in favor of the specification that it is a smooth 
function of x,, but maintains the normality stipulation. Section 6.4 will weaken 
the normality assumption. As with all subjective conditions, “smoothness” can be 
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characterized in different ways. This section takes up two different approaches. 
Somewhat surprisingly, each leads to a posterior kernel identical to that of the nor- 
mal linear model of Example 2.1.2—but with a different matrix of covariates X, 
and with a different interpretation of the coefficient vector B. That nonlinear regres- 
sion is thus isomorphic to linear regression has two desirable consequences. On the 
practical level, many Bayesian methods for the normal linear model can be applied 
in normal nonlinear regression, including the BACC as described in Section 5.1. 
On the conceptual level, many of the rich elaborations of the normal linear model 
that have been applied in Bayesian analysis can be applied directly in nonlinear 
regression. These include the extension to nonnormality in Section 6.4, and the 
weakening of the assumption that y; — E (y; | x+) is independently distributed is 
discussed in Sections 7.1 and 7.3. 


5.4.1 Nonlinear Regression with Smoothness Priors 


The essentials of nonlinear regression with smoothness priors are captured in the 
simple model 


yi = f) + er & NO, h7). (5.29) 


The function f(t) is defined on a closed interval t € [t1, T2]. In general, the 
vector of interest œ will include elements f(tj),..., fg) where {tj,..., tah 
is a collection of distinct points in [t1, T2]. The complete model A must therefore 
specify pL f(t), t € [t1, T2] | A]. The model must also incorporate the idea that f 
is a smooth function, in the sense that it is differentiable and df(t)/dt changes 
slowly with t. 

A convenient and powerful analytic tool for expressing these beliefs is the 
Wiener process W(t), defined on t € [0, c©) with W(0O) = 0. A standard repre- 
sentation is W(t) = Io dW (u), where it is understood that the orthogonal incre- 
ments dW(u) are normally distributed. [For further discussion, see Doob (1953), 
Section 2.9 or Hamilton (1994), Section 17.2.] One important property of a Wiener 
process is W(t +5) — W(t) ~ N(O,5), for all t > O and all s > 0; this limits the 
rapidity with which W can move as a function of t, and this feature can in turn be 
controlled by appropriate scaling of W. Another important property is that any pair 
of increments W(t +s) — W(t) and W(t’ +5’) — W(t’) has a bivariate normal 
distribution. Each increment has mean zero. If [t,t +s] and [t’, t" + s’] do not 
overlap, then the increments are uncorrelated; if the intervals do overlap, then their 
covariance is the length of the overlap. In general 


cov[W(t +s) — W(t), W(t! +s) — W(r')] = / Seely eRe N 
0 


More heuristically, movements of W(t) over disjoint regions are uncorrelated. 

A Wiener process has the properties ascribed to the function f’(t) = df(t)/dt, 
and thus we pursue the idea that f(t) | A is the integral of such a process. The 
approach is that of Shiller (1984). Thus, f(t) | A = nie h W (u) du, where the 
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hyperparameter h controls smoothness. For any two points t and s 


f@|Az= ney W(u) du = ia f awo du, (5.30) 
0 0 0 

F(S)|A aan [ W(v)dv= [ [aww dv (5.31) 
0 0 0 


have a joint normal distribution, with E[ f(t) | A] = EL f(s) | A] =0. If s >, 
then, from (5.30) and (5.31), we obtain 


ELf (x) f(s) | A] = ho? Í f f andak 
0 0 


Zaa) p vdv+ [ude] au 
0 0 u 


its T Tu? Aa '272 
=h 5 +u(s — u) | du = 6 (3s—T). (5.32) 
0 


Since var[ f(t)] = 4~'/"r3/3, the prior variance ascribed to f(t) at a point 
t = sı will depend strongly on the idea that t = 0 is a special point at which it is 
known a priori that f’(0) = 0. This is an artificial assumption. It arises not from 
prior ideas about smoothness (the reason for introducing the Wiener process as a 
model for the prior) but rather from the analytical necessity of an initial condition 
for f'(t). There is a similar problem with the slope of the function f(t) between 
two points sı and sz (s2 > s1), [f(s2) — f(s1)]/(s2 — sı). From (5.32), we have 


var{[f (s2) — f(s1)]/(s2 — s1)} = 
1 [ -= (s2 — sı)! ii 2s? s?(3s2 — 51) | [ —(s2 — s1)! | 
6h | 62- s)! s? (3s2 — s1) 2s3 (sy — s1)! 
= [3s; + (s2 — s1ı)]/3h, (5.33) 


which depends not only on the length of the interval s2 — sı but also on the size 
of sı. However, for any three points sı < s2 < 53, we find that for the change in 
the slope of f(t) 


5 (5.34) 


53 — S82 S2 — S1 


var| L = f(s) _ f(s2)— a _ (375) 
3h 
which can be derived from (5.32) in the same way that (5.33) was derived. Thus 
the distribution of any change in slopes does not depend on the artifice of an initial 
condition for f’(t). This fact is not surprising, given that f’(t) | A is a Wiener 
process, and changes in the level of a Wiener process over an interval depend only 
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on the length of the interval and not on the distance of the interval from the origin 
t = 0. Given s4 > 53, we have 


| — f(s.) f(s) — f(s) 
cov | = — —— 


53 — S2 S52 — S1 


Tey arel font) Ste) (5.35) 


6h 


S4 — $3 S3 — So 
If 56 > S5 > 54 > 53 > So > s1, then 


| — f(s)  f(s2)— fi) fls6)— fs) — f(s) — a) _ 
cov =0. 


S3 — S2 S2 — S1 S6 — S5 S5 — S4 
(5.36) 
Without loss of generality, suppose that there are m distinct values of x1, ..., Xr. 
Denote the ordered distinct values by s;(i = 1,..., m), and define s = (S1, ..., 5m)’ 


and B =[f(s1),..., f(Sm)]’. Then the vector of unobservables in (5.29) is 6% = 
(B', h), and (5.29) may be written y = XB + e with x; = 1 if x, = s; and xy = 0 
otherwise. The information in the smoothness prior of the form (5.34)—(5.36) may 
be expressed 


RB ~ N(0, G). (5.37) 
The matrix R is (m — 2) x m , with 


aj -1 = 
ri = (Sin. — $i) s fiii = [S41 — si) + (Sina — Sin], 


riiga = (Si42 — Sig) GG =1,...,m—2) 
and all other elements 0. The matrix G is (m — 2) x (m — 2) with 
gü = (Sina — $))/3h, 8ii+1 = 8i+1, i = (Si+2 — Si+1)/6h (i =1,...,m — 2) 


and all other elements 0. 

The derivation of this smoothness prior from a continuous process has substan- 
tial practical advantages in enforcing consistency when the prior is updated with 
additional prior information or with data, and in expressing posterior distributions 
for f(t) at points t that do not correspond to any s;. Suppose that we were to add 
a point s* between s; and s;+ in the list s1, ..., Sm, perhaps because f(s*) is an 
element of the vector of interest œ but s* Æ s; for any i = 1, ...,m. Incorporating 
this point directly in the smoothness prior, f(s;) and f(s;+1) are removed from B 
and replaced with f(s;), f(s*) and f(s;+1). Then the (i — 2)th and (i — 1)th linear 
combinations in RB are removed from (5.37) and replaced with 


(si — si)" f (Si-1) — [sj — 81-1)! + (8* — sD Gi) 

Ms Srey = ¢7, (5.38) 
(s* — SDI f (91) — Ls* — sD + (si41 — 8*)7'T! f(8") 

+ (sii — SD! f Sigs) = 6%, (5.39) 
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and 
(Gii — 8*)7' f (8*) — [sii — 8°)! + (Si42 — si41) VS (Sin) 
+ (S142 — Sint) f (8:42) = S844, (5.40) 


where (¢7, ¢3, ¢7,,)' is normal with mean 0 and 


GF i 2(s* —s;-1) s*—5; 0 
var | ¢F = bh s* — Si 2(si41 — Si) Sia. —58* i (5.41) 
Ch = 0 Si+1 —S*  2(Si42 — 8*) 


In the new prior, the marginal distribution of B is the same as in the original prior. 
This can be seen by multiplying (5.39) by 


[(s* — sT! + Gist — 5*) "16" — 5) 
and adding it to (5.38), and then multiplying (5.39) by 
[(s* — s)! + Gian —5*)'Vesi41 — 5%) 
and adding it to (5.40). The resulting two equations, after some algebra, are 
(si = si- f (51-1) = LCi = Si) + (Sin = sD 1F Si) 
+ (S141 = 51) fie) = G; 
(sina = SDF) — [Win — si)! + Giga — sin) ‘Tf Gia) 
+ (Sina = Sint)! fSi42) = ipi 
where 
ti = G7 HHO a) toe os) ie Sa 
Sign = oi i tE — 5) + Git — ay 


and from (5.41) 


var( bi ) £ | 2(Si+1 — Si-1) Si+1— Si | 

cipi 6h Sit41— Si 2a — Si) 

These are precisely the (i — 2)th and (i — 1)th linear combinations of RB that were 
removed in the first place. 

This smoothness prior, taken alone, does not constitute a proper prior distribution 
for £. Its contribution to the prior precision of B is R'G~'R, an m x m matrix of 
rank m — 2. The balance of a proper normal prior distribution for 6 must provide 
prior information about the level and slope of the function—the information from 
the Wiener process that we discarded at the outset because it was artificial. If 
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the additional information is normal and independent of the information in the 
smoothness prior, it can be represented 


PB ~ N(p, F). (5.42) 


[If the information involves f(s*), and s* is not an element of s, then s must be 
redefined so as to incorporate the new point s*, and the smoothness prior expressed 
in the form (5.37). As demonstrated in (5.38)—(5.41), this has no impact on the 
smoothness prior itself; the effect is simply to incorporate the fact that the smooth- 
ness prior applies to the new point s*.] Taken together, this information provides a 
proper prior distribution if rank [R’ P ] = m*, where m* is the number of points 
involved in the prior distribution. Then the full prior distribution is B ~ N(B, H~'), 
with H = R’'G-!R+PF"!P and B = H"'P’F'p. For a variant on this idea, see 
Example 5.4.1. E 

The vector of interest œ typically includes the function f(t) evaluated at points 
t not included in s. For example, if information about the posterior distribution 
of f(t) on the interval [t1, T2] is presented graphically, then good resolution 
requires evaluation of f(t) at 100 or more equally spaced points. Thus the posterior 
simulator must provide 


p(B, h,@ | y°,X, A) x ply® | B,h,@, X, A)p(B,h | A)p(@ | B,h, A). 


Since the distribution of y depends on £ and h but not w, p(y’ | B, h, œ, X, A) = 
D(y° | B, 4, X, A). Because of the consistency of the smoothness prior with respect 
to addition and deletion of points of evaluation, p(B, h | A) = fo p(B, h, œ | A) dw 
is the same regardless of the composition of œ. Hence 


p(B, h| Y°, X, A) x ply’ | Bh, X, A) p(B, h | A), 


so the posterior simulator excludes œ and takes no account of its subsequent spec- 
ification. Then 


plo | y°,X, A) = Ly plo | B,h, X, A) p(B, h| y’, X, A) dB dh. 


The smoothness prior distribution expressed as Ri B + R,w ~ N(O, G) provides 
the form of p(@ | $, h, X, A) = p(@@ | B, A). Since rank (R2) = dim(@) 


@ |(B, A) ~ N[—(R)G7'Ro)'R5G7'RB, (R{G'R))“']. (5.43) 


Replacing B with B“” ~ p(B | y’, X, A) in this expression provides the simulation 
algorithm for œ given the posterior simulator output. 


Example 5.4.1 Nonlinearity in Simple Regression (The online appendix con- 
tains data, annotated code, and output for this example.) Return to Example 5.1.1, 
which examined the impact of class size on test scores using the Massachusetts 
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Comprehensive Assessment System (MCAS) test data and a normal linear regres- 
sion model. In this example we remove the assumption that the regression of class 
size on test scores is linear, and replace it with the assumption that the regression 
is a smooth function of class size, using (5.29) and the methods described in this 
section. The objectives are to learn about the posterior distribution of the regres- 
sion function, and collect some evidence about the degree of nonlinearity in the 
regression function as indicated by the prior hyperparameter h. Achieving these 
objectives illustrates the construction of a proper prior distribution in a nonlin- 
ear model, and techniques for inferring the posterior distribution of the regression 
function evaluated at points that do not correspond to values in the data. 
To construct a prior distribution, amend (5.29) slightly, by writing 


yy = ay tox, + f(x) +e, e $F NOD (t= 1,...,7). (5.44) 


Let sı < -+ < Sm denote the m ordered distinct data points corresponding to 
the m x 1 vector B = [f(s1,),..., f (Sm)]. Then (5.44) has the form y = X;a + 
X,B + e, for the suitably arranged T x 2 matrix X; and T x m matrix X3. In the 
data set from Example 5.1.1, the student : teacher ratio (str) is rounded to the 


nearest one-tenth and ranges from 11.4 to 27.0. While T = 220, m = 83. The two 
restrictions 


dfs) =0, fr) = fm) (5.45) 
i=l 


identify a1, @2, and f in (5.44) without imposing any additional restrictions. [Recall 
the discussion of identification in Exercise 4.5.3; see also Poirier (1998).] The 
restrictions (5.45) are equivalent to PB = 0, with an obvious definition of P, and 
are a limiting case of additional information of the form (5.42), in which F becomes 
arbitrarily small. These exact restrictions can be imposed by writing B = Qf", 
with Q’ = [ lin—2 —3) Tn—2 tm-2 —4) ]. The restrictions plus the prior informa- 
tion RB ~ N(0, G) in y = Xi + X2ß +e are equivalent to B* ~ N (0, H’), 
in y = Xia + X3B*+.e, where Xš = X.Q and H, = Q'R'’G'RQ. If «| A~ 
N(&, H’), independent of B | A, then as h —> oo in G, the marginal likelihood 
of the model must approach that of the linear model y, = a; + @2x; + €; with 
the same prior distribution for œ and h. Proceeding as in Example 5.1.1, we take 
æ = (u, 0)’ and H; = (5/To7)X’X using the values of jz and ø? in that example. 

The model thus formulated is a special case of the normal linear regression model 
of Example 2.1.2, and therefore BACC may be applied. As noted in Example 4.3.1, 
the Gibbs sampling algorithm for the normal linear model produces drawings from 
the posterior distribution that are very nearly serially uncorrelated. There is little or 
no need for burn-in iterations, and only a few iterations are needed to draw values 
from the posterior distribution of the regression functions. Marginal likelihood can 
be computed exactly as described in Example 5.1.1. It is only for computing the 
posterior mean of the regression, displayed as a heavy line in Figure 5.1, that a 
larger number of iterations is needed, and in this example we use 100. 
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Figure 5.1. Posterior mean of regression of test score on student : teacher ratio (heavy line), and three 
drawings from the posterior distribution (three light lines) for each of four smoothness hyperparameters 
no; smoothness parameters (a) 0.50; (b) 1.00; (c) 2.00; (d) 5.00. 


Recall from (5.34) that 


varil fŒ + 1) — f= fE — f@ — DY = 2/3h. 


Alternative values of the smoothness parameter h~!/? 


marginal likelihoods: 


produce the following log 


nie Log Marginal Likelihood 
0 (linear model) —913.6910 
0.5 —912.4573 
1 —911.5170 
2 —911.6266 


5 —913.5087 
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The model with h~!/* = 1 has the highest marginal likelihood. However, differ- 
ences in these models are not great; the Bayes factor in favor of this model, as 
opposed to the linear model, is 8.15. 

The regression functions can be plotted using the 157 values in increments of 
one-tenth over the data range in the student : teacher ratio. Since not all of these 
values correspond to data points, values for the non—data points are drawn using 
(5.43). Adding more points to the plot is straightforward and increases computation 
time only negligibly. 

Increasing values of i” °’“ permit regression functions that are increasingly 
rough, while as h~'/* — 0, the function becomes linear. Note in Figure 5.1 that 
the posterior mean of the regression function (solid line) appears smoother than the 
drawings from the posterior distribution (lighter lines) in every case. That is because 
visual smoothness is inversely related to the absolute value of second differences 
of a function; Jensen’s inequality accounts for the difference. It is important to 
keep in mind that while the visual appearances of these curves differ, the evidence 
that discriminates between them is rather weak as indicated by the Bayes factors 
implicit in the log marginal likelihoods. 


1/2 


In Example 5.1.1 other covariates were found to be important in accounting 
for the average test score in each district: the percentage of students still learning 
English, the percentage of students receiving a subsidized lunch, and the log of 
average district income. Example 5.4.1 omitted these covariates. Reintroducing 
them in linear fashion poses no essential complications. 


Example 5.4.2 Nonlinearity in Multiple Regression (The online appendix con- 
tains data, annotated code, and output for this example.) Maintain all the assump- 
tions of Example 5.1.1, except that now the student : teacher ratio enters the model 
in nonlinear fashion. Then we may express the model as an extension of (5.44): 
n uy 
Ye = atom + fY yjet en eS NOAH) @=1,...,7). 
j=l 
(5.46) 
In this application n = 3 and j indexes the three other covariates. More gen- 
erally, (5.46) is sometimes called a semiparametric model, since it mixes the 
function f(x;) (“nonparametric”) with the other linear components in the model. 
The Bayesian approach developed in this section renders this distinction inessen- 
tial, since we may select from among the continuum of unobservables any finite 
number of functions of interest and carry out inference in a fashion that is logically 
consistent over all possible choices of these functions of interest. 
We proceed in the same fashion as in the previous example. The prior distribution 
of the vector (œ1, &œ2, Y1, Y2, Y3) is the same as the prior distribution of B in 
Example 5.1.1, and the alternative values of the smoothing hyperparameter h are 
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the same as those in the previous example. The model comparison exercise produces 
the following findings: 


h! Log Marginal Likelihood 
0 (linear model) —805.8189 
0.5 —806.2772 
1 — 807.0044 
2 —808.2665 
5 —811.1137 


The original linear model is the most favored of the five specifications, but the 
evidence in favor of linearity is slightly weaker than the evidence in favor of non- 
linearity in the previous example. The Bayes factor in favor of the linear model, 
versus the nonlinear specification with A™!/? = 1 is 3.27. On the other hand, the 
evidence that the three additional covariates should be included is overwhelm- 
ing, as indicated by comparison with the log marginal likelihoods in the previous 
example. 

Figure 5.2 provides some aspects of the posterior distribution of the regression 
function, evaluated at the student : teacher ratio values indicated on the horizontal 
axes, and sample median values of all of the other covariates. Whereas a stu- 
dent : teacher ratio of 25 as opposed to 15 accounted for a 20—30-point difference 
in test scores in the absence of the other covariates, here it accounts for somewhat 
less than 10 points, which in turn is roughly consistent with the posterior mean of 
—0.68 for the student : teacher ratio coefficient in Example 5.1.1. The predomi- 
nant characteristic of the nonlinearity of the regression function in Example 5.4.1 
was a kink at a student : teacher ratio of ~17, a feature that is exhibited weakly 
in Figure 5.2. Moreover, the uncertainty about departures from nonlinearity clearly 
overwhelm any such systematic tendency, as indicated by the lighter lines in the 
figure. This is consistent with the Bayes factor in favor of the linear specification 
as opposed to any one of the nonlinear specifications. 

The impact of the nonlinear specification in student : teacher ratio has a neg- 
ligible impact on the posterior distribution of the other coefficients of the other 
covariates. Comparison with the results in Example 5.1.1 shows the following: 


Student : Teacher Ratio 


3 Linear Nonlinear 
Functional Form 
Coefficient Posterior Mean SD Mean SD 
Learning English —0.4086 0.2788 —0.3990 0.2811 
Subsidized lunch —0.5173 0.0678 —0.5117 0.0673 


Log income 16.418 2.964 17.005 2.902 
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Figure 5.2. Posterior mean of regression function for test score nonlinear in student : teacher ratio 
(heavy line) and linear in other covariates, and three drawings from the posterior distribution (three 
light lines) of the nonlinear student : teacher portion for each of four smoothness hyperparameters 
h-'/?: smoothness parameters (a) 0.50; (b) 1.00; (c) 2.00; (d) 5.00. 


5.4.2 Nonlinear Regression with Basis Functions 


A sequence of normal linear models A;(j = 1, 2,...) captures the essentials of 
nonlinear regression with basis functions. In model A;, we obtain 


kj 
iid. 3 
y= fie) +e = X bibi) +E = Bb | (X:) + En & ~ NO,h D, 
i=l 
(5.47) 
As j increases so does k;, and the function f; becomes, loosely speaking, more 
flexible. For example, if x, is 2 x 1 and the basis functions are monomials, then a 


sequence might be 


$j,() = 1 for all j = 1,2,... 
$n) =X;  j3(X) = x2 for all j = 2,3,... 
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 ja(X) = xf, bj5(X) = x1%2, bj6(X) = x5, for all j = 3,4,... (5.48) 
$j) = cee @ jg (X) = x7Xo, $ j9(X) = x1x2,  j10(X) = a. eae 


for all j =4,5,... and so on. For this sequence kj = j(j + 1)/2. By the 
Weirstrass theorem, there exists a sequence of such basis functions {@ i such that 
lim joo SUPyec | f(s) — fj (x)| = 0 for any compact set C C R?. Other systems of 
basis functions that have proved useful in econometric applications include Fourier 
sequences (Gallant 1981) and Muntz—Szasz expansions (Barnett and Jonas 1983). 
In all of these systems, the functions ø; are chosen so that f(x) is forced to be 
smooth when j is small, and the smoothness assumption is weakened steadily as 
j increases. Precisely what is meant by “smooth” and “weakened” depends on the 
particular system of basis functions. 

The approaches to nonlinear regression in (5.29) and (5.47) are complementary. 
Basis functions can be applied when the domain of f is multidimensional in much 
the same way as when it is unidimensional, especially for monomial basis functions. 
On the other hand, the ordering of the covariates x, exploited in developing the 
approach in Section 5.4.1 cannot be extended to multidimensional x;. For any 
given order of expansion j, basis functions force the function f to be smooth 
no matter how strong the evidence to the contrary in the data, whereas for a 
given smoothness prior hyperparameter h in Section 5.4.1, a sufficiently strong 
departure from smoothness in the data can place substantial posterior probability 
on regression functions that are not smooth. This latter contrast is mitigated, to 
some extent, by the fact that each entails a portfolio of models, indexed by h in 
the case of smoothness priors and by j in the case of basis functions. In each case 
the models may be compared or averaged as described in Section 2.6.1. The less 
smooth is the function f, the smaller will be the hyperparameter h favored by 
Bayes factors in the former case and the larger will be the order of expansion j 
favored in the latter. 

In formulating prior distributions of the coefficient vector B in nonlinear regres- 
sion with basis functions, it is useful to think in terms of the function f. This is 
especially important in comparing variants with different numbers of basis func- 
tions, because it ensures comparable priors. For example, the prior distribution 
consisting of the components f (a;) aS N(u, t) i =1,...,n) implies the prior 
distribution consisting of the components 


Pioa) | Aj" N(w, PG =1,...,n) 


when the order of expansion is j. If J is the highest order of expansion considered 
and n > kyj points are chosen appropriately, this approach will provide comparable 
and proper prior distributions for the coefficients in all orders of expansion. This 
approach is illustrated in the following example. 


Example 5.4.3 Basis Functions for a Single Covariate (The online appendix 
contains annotated code and output for this example.) In the nonlinear model (5.29), 
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nonlinearity in the function of one variable f(-) may be captured by using a 
sequence of basis functions with a single argument. This example uses polynomials 
of increasing order: 


j i 
y= Baxi tere “NO,A) @=1,...,T) (5.49) 
i=0 


in conjunction with the sequence of conditionally conjugate prior distributions 


H = (kj/To?)X'X, B = (X'X)'!X'trp = (u,0,..., 0), (5.50) 


and the same values of g? and u used in the previous examples. In (5.50) X is the 
covariate matrix in (5.49) and has k j = j + 1 columns. 

This model may be applied using BACC, much as in Example 5.4.1. The 
marginal likelihoods are 


J Log Marginal Likelihood 
1 (linear model) —913.6910 
2 —911.0031 
3 —912.6455 
4 —914.1524 
5 —915.3689 


The results are similar to those in the nonlinear model of Example 5.4.1, in that 
there is moderate evidence in favor of nonlinearity, with nonlinearity in this case 
represented by low-order polynomials. Note that the marginal likelihood of the most 
favored model, a polynomial of order 2, is very nearly the same as the most favored 
model in Example 5.4.1, which has a smoothing hyperparameter of h~'/? = 1. The 
posterior distribution of the regression function reveals both further similarities and 
important contrasts with the approach taken in that earlier example. 

In general the posterior means in Figures 5.1 and 5.3 show functions of simi- 
lar global orientation and shape, and appear more irregular as more flexibility is 
allowed. The polynomial basis functions exhibit stronger curvature with less flexi- 
ble models, and the smoothness prior shows a kink near the student : teacher ratio 
value of 17 with less flexible models. The most flexible models, in panel (d) of 
each figure, exhibit posterior means of regression functions that are almost iden- 
tical. The three drawings from the posterior distribution of regression functions 
in each panel illustrate the fact that the functions themselves are quite different. 
In the case of basis functions (Figure 5.3) the second derivative at a point is a 
deterministic function of second derivatives globally, and therefore a function of 
information in the data as well as the prior, whereas in the case of smoothness 
priors (Figure 5.1) it is a function only of the prior distribution. This property of 
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Figure 5.3. Posterior mean of regression of test score on student : teacher ratio (heavy line), and three 
drawings from the posterior distribution (three light lines) for polynomials of different order: polynomial 
orders (a) 2; (b) 3; (c) 4; (d) 5. 


posterior distributions of regression functions with smoothness priors is obscured 
in the posterior mean. 

Introducing the other covariates, as in Example 5.4.2, produces similar compar- 
isons and contrasts. The marginal likelihoods are 


J Log Marginal Likelihood 
1 (linear model) —805.8189 
2 —807.7776 
3 —809.8334 
4 —810.9572 
5 —812.7080 


The linear model is, again, favored over any of the alternatives that permit a 
nonlinear, but separable, effect of the student : teacher ratio on test scores; once 
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Figure 5.4. Posterior mean of regression function for test score polynomial in student : teacher ratio 
(heavy line) and linear in other covariates, and three drawings from the posterior distribution (three 
light lines) of the nonlinear student : teacher portion for different orders of the polynomial: polynomial 
orders (a) 2; (b) 3; (c) 4; (d) 5. 


again, the evidence is not strong. The posterior means of the regression as a function 
of the student : teacher ratio, using this polynomial expansion, bear a striking 
resemblance to those using the smoothness priors, as may be seen by comparing 
Figure 5.4 with Figure 5.2. 


An attraction of nonlinear regression with basis functions is that the nonlinear 
component of the regression can be a vector, as well as a scalar, as indicated in the 
illustration (5.48) for polynomial basis functions and a vector of dimension 2. For a 
vector of dimension k and a polynomial expansion of order J, the number of terms 
in the expansion is of the order k”. For a given sample size, the larger is k, the 
lower is the order of expansion that will be reasonable, appraised (appropriately) 
by marginal likelihoods for the different expansions. In the case of the smoothness 
priors developed in Section 5.4.1, however, treatment of the vector case is impos- 
sible. That is because the approach taken there relies critically on the ordering of 
the covariates, and vectors of dimension greater than one cannot be ordered. 
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Example 5.4.4 Basis Functions for a Pair of Covariates (The online appendix 
contains annotated code and output for this example.) In the context of the test 
score example with all covariates, consider the polynomial expansion (5.48) using 
the covariates student : teacher ratio (str) and log income (income). Continuing 
to denote the full matrix of covariates by X, the prior distribution is (5.50). Con- 
sider a few orders of expansion, presented in the following table along with the 
corresponding marginal likelihood values: 


Terms Included: Log Marginal Likelihood 
str, income —805.8189 
Above plus str - income —805.6166 
Above plus str’, income? —808.9928 
Above plus str?~'-income’(i = 0, 1, 2, 3) —814.0507 


Bayes factors favor the model with the single interaction term: the values are 1.22 
against the linear model, and 29.26 against the model with quadratic terms. 

As always, it is important to examine the implications of alternative models, 
including those with low posterior probabilities, if it is thought that the inclusion 
of these models in the analysis may affect the decision at hand. Using BACC 
and mathematical applications software, we can produce the representations of the 
posterior mean of the regression function for the model with the single interaction 
term, and the one that adds the two quadratic terms, shown in Figure 5.5. In 
both cases, the relationship between the student : teacher ratio and the district 
average test score depends strongly on average income in the district. The lower the 
average income in a school district, the more rapidly the expected average test score 
drops as student : teacher ratio increases. As incomes increase, this relationship is 
attenuated, and for log per capita incomes above about 3.05 [exp(3.05) x $1000 = 
$21,115, about one-quarter of the sample], it is in fact reversed, so that expected test 
scores increase as the student : teacher ratio increases. In the context of decision 
problems like the ones taken up in Example 5.1.2 and Exercise 5.1.5, this is a 
serious complication. Exercise 5.4.5 pursues this issue. 

Figure 5.5 conveys the posterior expectation of the regression function, but no 
other aspects of its distribution. Interactive displays can convey useful descriptions 
of uncertainty about a function; code for some displays may be found in the online 
appendix for this example. In some decision problems, like the one stemming from 
the loss function (5.5) in Example 5.1.2, uncertainty doesn’t matter, but in others, 
such as the one stemming from the loss function (5.6) in the same example, it does. 
For the models studied in Figure 5.5 it is easy to find the posterior probability that 
the regression function is monotone decreasing in student : teacher ratio, for all 
values of log income and student teacher ratio included in the grid; it is .065 in 
the case of the model with a single interaction term and .031 with the model that 
adds the quadratic terms. 
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Figure 5.5. Contour plots of the posterior mean of the regression function, evaluated at combinations 
of two covariates as indicated, and sample median values of all other covariates: (a) linear terms plus 
str-logincome interaction; (b) linear terms plus polynomials to order 2. 


Exercise 5.4.1 Completing the Argument Derive (5.34), (5.35), and (5.36). You 
may find a symbolic processor, like Maple or Mathematica, helpful. 


Exercise 5.4.2 Credible Sets In non-Bayesian nonparametric regression it is 
common to find an approximate confidence interval for f(x) for each of many 
closely spaced values of x over a given range. In the Bayesian nonlinear model of 
Section 5.4.1 we can find an exact credible interval for f(x) for these same values. 
In each case we may plot the intervals as a function of x, and the resulting picture 
provides a representation of uncertainty about the function. 


(a) Choose either Example 5.4.1 or 5.4.2, and plot 80% credible intervals over 
the range of student : teacher ratio in the data, 11.4—27. 


(b) Find the posterior probability that f(x) is contained in the intervals found 
in (a), for all x in the range 11.4—27. 
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(c) Find a constant c such that 
P{ f(x) € [f(x) —c, f(x) +e] Y x € [11.4, 27]} = 0.8, 
where 
f(x) = EL f(x) | (y’, X, A]. 


Plot the functions F(x) —¢, F(x), and f(x) +c on the same graph with 
the credible intervals found in (a). Compare and discuss. 


Exercise 5.4.3 Inference about Shape The posterior means of the regression 
functions in Figures 5.1 through 5.4 are monotone decreasing and concave over 
the range shown in many cases. 


(a) Show that the posterior probability that the regression function has this shape 
is zero in all cases, in Examples 5.4.1 and 5.4.2. 

(b) Find the posterior probability that the regression function has this shape 
for polynomials of order 2, 3, 4 and 5 in Examples 5.4.3 and 5.4.4 (For 
continuation, see Exercise 8.2.5.) 


Exercise 5.4.4 Model Combination Figures 5.1 and 5.2, together with the 
marginal likelihoods that accompany these models, indicate substantial uncertainty 
over models, and perhaps about shape as well. Select the model corresponding to 
one of the figures for this exercise. 


(a) Assign equal prior probability to the linear model and each of the four non- 
linear models. Conditional on this specification, find and plot the posterior 
expectation of the regression function as a heavy line, and draws from the 
posterior distribution as lighter lines, in the same way as done in the figures. 


(b) An alternative approach to uncertainty about the smoothing hyperparameter 
h is to embed it in a hierarchical prior. Construct such a prior, and then 
find the posterior distribution of the regression function as well as h7!/. 
Compare the result with those for the examples in Section 5.4.1. 


Exercise 5.4.5 Decisionmaking in the Nonlinear Regression Model Example 
5.1.2 introduced a decision problem about the student : teacher ratio using the 
alternative loss functions (5.5) and (5.6). That example provided solutions of those 
problems assuming a linear regression function. 


(a) Using the smoothing approach of Example 5.4.2, or the polynomial approach 
of Example 5.4.3 with the full set of covariates, solve the same decision 
problems, for the same combinations of c and d considered in Example 5.4.2. 
Compare your answers with those in that example. 
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(b) Now consider the same problem using the polynomial in student : teacher 
ratio and log income of Example 5.4.4. Recall the complication implied 
by the results in Figure 5.5 for school districts with per capita incomes in 
the top quartile of the sample, implying that these districts would choose 
very high student : teacher ratios. You might address this problem by model 
averaging, by imposing shape constraints, or in some other way. After doing 
so, find the optimal student : teacher ratio for the same combinations of c 
and d considered in Example 5.4.2, and some alternative, representative 
values of income. 


Exercise 5.4.6 Specification of the Earnings Regression Function Example 
6.4.1 introduces the regression of log earnings on age and education, emphasiz- 
ing the nonnormality of the distribution of the regression residuals. That example 
assumes that the regression function of log earnings on age and education is an 
interactive polynomial of order 4 in age and 2 in education. In answering the fol- 
lowing questions, assume that the distribution of the regression residual is i.i.d. 
normal. 


(a) What is the evidence in the data on the adequacy of this choice? Do marginal 
likelihoods for different orders of polynomial expansion indicate a clear- 
cut choice, or do they suggest model averaging using several alternative 
specifications? 

(b) From the models investigated in (a), select the three with the three highest 
marginal likelihoods. Is there much difference among the corresponding 
functions E(y | a,e,y°®, A) as indicated, for example, by representations 
like those in Figure 5.5? 


CHAPTER 6 


Modeling with Latent Variables 


Latent, or unobserved, variables are often important components in econometric 
and statistical models. They occur in these models for a variety of reasons. 


1. The model may pertain to a substantial number of heterogeneous entities, each 
with its own set of parameters. It is natural to regard these parameters as latent 
variables having a distribution that, in turn, is characterized by unknown 
parameters. (See Example 3.1.1.) Section 3.1 showed that this formulation 
is equivalent to a hierarchical prior distribution for the parameters of the 
heterogeneous entities. The techniques described in this chapter have been 
applied with noted success in this context, particularly in marketing; see, for 
example, Ainslee and Rossi (1998), Allenby and Rossi (1999), and Kim et al. 
(2003), as well as Rossi and Allenby (2003) for an overview. 


2. If data are missing because of complications in data collection, it is natural to 
regard the missing data as latent. The original model must be supplemented 
with one for the process by which data are recorded or not. The latter model 
turns out to be simple if data are missing at random (Example 2.2.3). [For 
specific examples, see Exercises 5.2.1 and 5.3.3(d).] 

3. Outcomes may not be fully observed, because of the way data are recorded, 
or because observable behavior reveals only certain characteristics of the 
data. It is natural to regard the outcomes as latent variables. The complete 
model has two components: the first linking the incompletely observed out- 
comes to parameters or other unobservables in the usual way, and the second 
linking data to incompletely observed outcomes. Section 6.1 provides a gen- 
eral approach for this setting, and Section 6.2 applies it to the modeling of 
discrete outcomes. 

4. In a mixture model the distribution of a random variable depends on a latent 
state. The state can be either continuous or discrete, and in either case a 
supplementary model describes the distribution of the state. Mixture models 
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provide versatile tools for extending simple distributions, such as the nor- 
mal, to classes that are richer, more flexible, and more realistic. Section 6.4 
enriches the normal linear model in this way. 


A common characteristic of all these models is a hierarchy, or layering, in which 
the distribution of latent variables is accounted for by a stated set of assumptions 
and unobservables (parameters), and the distribution of observables is then driven 
by the latent variables together with some combination of the same and additional 
unobservables. (Exercise 3.1.2 developed the essence of this structure.) This hier- 
archy leads to significant simplification in the posterior distribution that can be 
exploited by the Gibbs sampler and related posterior simulators, as illustrated in 
Exercises 5.2.1 and 5.3.3(d). This simplification plays a critical role in the models 
described in this chapter as well. 


6.1 CENSORED NORMAL LINEAR MODELS 


An important source of earnings data in the United States is the records kept by the 
Social Security Administration. These records track individuals’ annual earnings 
subject to social security tax. The tax is applied to all labor earnings, up to a limit 
that changes from year to year. Thus if t indexes an individual in a particular year, 
a known upper limit yž applies to social security earnings y; for observation t. If 
we denote actual earnings by Y;, then 


Y if % <y 

yı = eS aed * > 
yp if y > y; 
This is an example of censoring of a measurement. In this case values that exceed 
a known threshold are replaced by the threshold, a process often called “right 
censoring.” It is important that the threshold be known. Censored measurement of 
the outcome variable in the normal linear model provides an important special case 
of the censored normal linear model. 

As a second motivating example, suppose that T households indexed by t have 
preferences over a specific good x and all other goods z given by the utility function 


U, (x, z) = log(a, + x) + blog(z). 


The unobservable a, is specific to household t. Only nonnegative amounts of x 
and z may be consumed. Clearly, each household tf must consume some positive 
amount z; of z. If a, < 0, then household t must consume more than —a, units of 
x, whereas if a, > 0, household tf may choose not to consume x at all. Suppose 
that in each household ¢ the sum of expenditures on x and z is fixed, and that these 
goods have prices p,; and pzs, respectively. If household ¢ consumes a positive 
amount x, of x, then (0U,/0x,)/(0U;/0p;) = Pxt/ Pzt, and in this case 


zi/bla, + Xi) = Prt / Pet => Xt = Zt Par /bPxt — ar. 
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If a; > Z:Pz/bp»r, then household t does not consume x. If the unobservable a, is 
independently and identically distributed across households, a; ~ N (u, o°), then 


iid. 
x, = max[—u +b"! (pz,2,)/Px, +8, 0l, & ~ N(0,0°) (t=1,..., T). 


Models of this kind have been important in marketing and the success of methods 
like the ones described in this section; see, for example, Kim et al. (2003). 

Note that the second example produces a left-censored outcome, whereas the 
first produces a right-censored outcome. In the second example the censoring points 
are all the same (zero), whereas in the first they vary (y;") but are known. In order to 
handle both of these examples as special cases of a more general model, this section 
develops a general version of the censored normal linear model. This treatment is 
extended to nonnormal censored linear models in Section 6.4.3. BACC supports 
censored linear models at the level described in this chapter. 

A censored normal linear model begins with 


Y = XBt+e, e~N(O,h'Iy), (6.1) 
Txl Txk 
in which Y = (91, ..., yr)’ is a vector of latent variables. There is a known, set- 


valued function 
C, = c(9) (6.2) 


mapping each possible outcome Y, into exactly one set C,. The observable set 
C, contains the latent variable y,. It may be a (half) open or (half) closed inter- 
val. Important special cases are intervals including —oo or oo as endpoints, and 
singletons. Denote the collection C = (C),..., Cr). 

In the first motivating example Y, is actual earnings. If y, < yž, then C; is the 
singleton y,; otherwise C; = [y*, 00). In the second motivating example 


% = =u + b7! (p,,24)/ Px, + Er, € ~ NO, T). 


If Y, > 0, then C, = Y; otherwise C, = (—oo, 0]. 
In the conditionally conjugate prior distribution B and h are independently dis- 
tributed: 


B| A~ NB, H, sh | A~ x7Q). (6.3) 


Utilizing (6.1), (6.2), and (6.3), the joint distribution of observables and un- 
observables in the censored linear model is 


p(B. h, Y, C |X, A) = p(B | Ap(h | A)p(y | B, h, X, A)p(C |F, A). 
From (6.2) 


T T 
pC IF, A = | [C 1%, 4) =] [le - (6.4) 
t=1 


t=1 
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[Observe that p(C; | ¥;, A) in (6.4) is a probability function and not a probability 
density function; it places all its mass on a single observable outcome set indexed 
by ¥;.] Hence p(B, h,y,C | X, A) œ 


exp[—(B — B)'H(B — B)/2Jh'*? exp(—s?h/2) (6.5a) 
- exp[—h(¥ — XB)'(¥ — XB)/2] (6.5b) 
T 
-[] 1c. (6.5) 
t=1 


The kernel of (6.5a)—(6.5c) in (£, h, Y), with the observed outcome C° replac- 
ing the observable C, is the kernel of the posterior distribution. The conditional 
posterior distributions are simple. The kernels of p(B | h, Y, y°) and p(h | B,Y, y’) 
are determined by (6.5a)—(6.5b) alone, from which 


b | (h, yy’, X,A) ~ NB, B), (6.6) 
where H = H+hX’X and 8 =H (Hf + hX’9), and 
3h | (B,¥,y’, X, A) ~ x7), (6.7) 


where 3° = s? + (Y — Xp) (F — XB) and T = v + T. These results reflect the fact, 
also apparent in (6.1) and (6.3), that with F in the conditioning set, thus treating it as 
observable, this model is precisely the same as the normal linear model introduced 
in Example 2.1.2. The distributions in (6.6) and (6.7) are the same as those in that 
example, but with y replacing y°. 

The kernel of p(y | $, h, C°, X, A) is determined by (6.5b)—(6.5c). Utilizing the 
equivalent expression 


T 
| [expl—ac5; - B’x,)7/21 
t=1 
in lieu of (6.5b), it is apparent that 91, ..., Yr are conditionally independent, with 
POS: | B, h, C?, X1, A) x exp[—h(5; — BX)” /2 cy (Fi). 
Thus 
Y: | (B, h, C?,x;, A) ~ N(B'x;,h7') subject to J, € C?. (6.8) 
In the first motivating example, conditional on B and h, Y, = y? if y, < y* and 9; 


~ N(B’x,,h~') subject to Y, > y* if y, = y*. In the second example, conditional 
on B and h, ¥; = y, if y, > 0 and ¥, ~ N(6’x;,h7') subject to ¥, < 0 if y, = 0. 
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This algorithm is the same as that developed in Example 4.3.1, but with the 
additional step (6.8). It was first proposed by Chib (1992). The approach can 
be applied to many kinds of censoring in more complex models as well. See 


Secti 


on 6.4.3, and Cowles et al. (1996) for examples. 


Exercise 6.1.1 Categorical Censoring Household income in marketing surveys 
is often reported in brackets: for example, under $15,000, $15,000—$25,000, ..., 


over 


(a 


(b 


(d 


(e 


$95,000. 


) Suppose that household income is the outcome variable in a normal linear 
model. Using clear and consistent notation, show that this model, in con- 
junction with the reporting of income in brackets, is a special case of the 
censored linear model. 


wa 


Suppose that the logarithm of household income is the covariate x;, (the 
last column of X) in a normal linear model. Suppose further that the nor- 
mal linear model is a correct specification—in the notation of Section 3.4, 
p(y | X, D) = p(y | B, h, X, A) for appropriate values B = B* and h = h*. 
Finally, suppose that household income is incorrectly assumed to be at the 
midpoint of the observed bracket, and $140,000 is assumed to be household 
income if the observed bracket is ($95,000, co). Show that the pseudotrue 
values of £ and h in this situation are not B* and h*, respectively. 


Let X* denote columns 1 through k— 1 of X, and X” = [xj,..., x7]. 
Formulate a complete model incorporating the assumption that household 
incomes xk are conditionally independent given the xž, with log(x;,) ~ 
N(y’x*, j~'). Utilize a prior distribution for y and j that is independent of 
the prior distribution for B and h. 


< 


wa 


Outline a Gibbs sampling algorithm to simulate the posterior distribution of 
the model in (c). 

) Given the assumptions made about p(y | X, D) in (b) and about x;, in (c), 
are the pseudotrue values of £ and h in (c) equal to B* and h*, respectively? 


(f) Suppose that in part (b) the level of household income rather than the log- 


arithm of household income were the covariate x,,. How would this affect 
the Gibbs sampling algorithm in part (d)? 


Exercise 6.1.2 Inference for Censored Outcomes For the normal linear regres- 


sion 


model 


y | (B. h, X, A)~N (XBAI), B| A~ N(B,H'), Sh ~ x7), 


three 


independent samples have been collected. 


Sample | has T = T; observations, with observed covariate matrix X = X; 
and the observed outcome vector y. 
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e Sample 2 has T = T, observations. The covariate matrix X = X, is observed, 
but the outcome vector y2 is not. However, for any two observations f and s, 
we know whether y; > ys or ys > yr. (In other words, we observe the rank 
ordering of the outcome variables.) 

e Sample 3 has T = T; observations. The covariate matrix X = X; is observed. 
The outcome vector y$ was also observed, but through an error in data pro- 
cessing the correspondence to the covariate matrix is no longer known. (In 
other words, we observe the 73 outcomes but don’t know which observations 
they correspond to.) 

(a) Using samples 1 and 2, construct a simulator that draws from the posterior 
distribution for B, h, and y2. Be as explicit as you possibly can about the 
distributions used in the simulator. 

(b) Could you use sample 2 by itself to construct a simulator that would draw 
from the posterior distribution for B, h, and y2? If so, state the algo- 
rithm for the simulator. If not, indicate the difficulty and what additional 
information would be required. 

(c) Using samples | and 3, construct a simulator that draws from the posterior 
distribution for B, h, and y3. Be as explicit as you possibly can about the 
distributions used in the simulator. 


(d) Show that the simulator constructed in part (c) is ergodic. 


6.2 PROBIT LINEAR MODELS 


Suppose that T individuals, indexed by f, each allocate income y, between two 
goods, x and z, with respective prices px and p; constant across individuals. Good 
x is continuously divisible, but z can be only 0 or 1. (For example, z might represent 
a decision to enlist in the military, or not.) Individual t’s utility function is 


U,(x, z) = z(a — b/x) + (1 — z)(r; — s/x); b > 0, 5 > 0; 
and he/she consumes out of income y,. If z = 0, then x = y,/p, and U,(x, z) = 
ri — Spx/y,. If z= 1, then x = (y, — pz)/p, and U,(x, z) = a, — bp, /O; — pz). 
Individual t chooses z; = 1 if —bp,/(+ — pz) + SPx/yt +4 — r, > 0. 


The econometrician observes px, pz, and (y1, Z1),.--, (vr, Zr). She does not 
observe (a1,71),..., (ar, rr), but suppose that she is willing to take 


ste 
(a; — ri) | (Px, Par Yis ---3 YT) © N (u, 0°) 


as representative of the distribution of these unobservables across individuals, and 
to regard income y, as ancillary. Define the unobservable 


Aus iid. 
Z = u — bpx/ (Yi — Po) + SPx/Yi + Er; £1 “& NO, 0°) (6.9) 
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and note that z; = I(o,90)(Z;). From (6.9), we have 


P(z =1|b, 8, 4, 0°, Px, Pys Yo A) = P{o T [-bpx/ Or — Pz) + SPx/Y: + uJ}. 
(6.10) 
Because nothing is changed if b, s, u, and o are scaled by a common positive 
factor, the normalization o? = 1 is convenient. 
In a probit model, A, the observables are a T x k matrix of covariates X = [x;, 
...-, Xr]' and a corresponding set of T binary outcomes with 


P (first binary outcome | x,, A) = ®(6'x,), (6.11) 
P (second binary outcome | x,, A) = 1 — ®(B'x,). (6.12) 


In the motivating example the three covariates are a constant term, — px / (yt — pz), 
and p,/y;; the first binary outcome is z; = 1 and the second binary outcome is 
zt = 0. 

The substance of the probit model is (6.11)—(6.12), regardless of how the binary 
outcome is coded. If we introduce the latent variables Y, = B’x, + €;, €t ~ N(O, 1), 
then the first outcome corresponds to y, < O and the second to y, > 0. In this 
context the natural outcome coding is the set-valued function C, = c;(¥;,), with 
C, = (—oo, 0] for the first outcome and C, = (0, co) for the second. The condi- 
tionally conjugate prior distribution is B ~ N(B, H™'). 

The joint distribution of observables and unobservables in the probit model is 


p(B. y,C |X, A) = p(B | A)p(y, P |X, A)p(C |F, A) 


E 
cc exp[—(B — B)'H(B — B)/2) expl—(¥ — XB)'(¥ — Xp)/2 [ | 1c, 5). 


t=] 


This expression is identical to (6.5a)—(6.5c) for the censored linear model, except 
that here h = 1. Proceeding as in the analysis of that model, we obtain 


B\(¥.C,A)~N@H) 


where H =H+X’X and B =H (Hf + X’9). In the distribution of F conditional 
on (C, B, X, A) the elements ¥,, known as probits, are independent: 


P(%: | B, Ci, X, A) œ expl- (0; — B’x:)°/21c, (9). 


These conditional posterior distributions are the basis of a very simple Gibbs 
sampling algorithm, first proposed in Albert and Chib (1993b). BACC incorporates 
the probit linear model using the conjugate prior distribution and posterior simula- 
tion algorithm described in this section. This approach can be extended to situations 
involving more complex choices. When one choice is made from several alterna- 
tives, the logical extension is the multinomial probit model, for which Bayesian 
Markov chain Monte Carlo methods have been developed by McCulloch and Rossi 
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(1994), Geweke et al. (1994, 1997), and McCulloch et al. (2000). When there are 
several related dichotomous choices, the natural extension is the multivariate probit 
model; see Chib and Greenberg (1998). 


Exercise 6.2.1 Normalization and Representation in the Probit Model Con- 
sider the motivating example at the start of this section. 


(a) Show that if the same constant is added to a, and r;, an individual’s con- 
sumption will not change. What does this say about the econometrician’s 
decision to assume only a distribution for a; — r,, rather than assume a 
nondegenerate bivariate distribution for (a,, 7,)? 


(b) Show that if the parameters b, s, u, and o in (6.9) or (6.10) are scaled by 
a common positive constant, then there is no change in the distribution of 
the observables (y1, Z1),..., (Yr, Zr). 

(©) Why is it preferable to resolve the indeterminacy in scaling by taking o° = 1 
rather than b = 1 ord = 1? 


(d) Derive (6.10) from (6.9). 


Exercise 6.2.2 Ordered Probit Model In this model, y | ($, h, X, A) ~ N(XB, 
h-'Iy). However, the variables y, are unobservable. Instead, we observe the out- 
come y; = —1 if ¥, <0, y, =0 if ¥, € [0,1], and y, = 1 if ¥, > 1. (In actual 


application this could be coding for the outcomes “negative,” “indeterminate,” and 
“positive” in a medical test.) 


(a) Are the “cutoff” values of 0 and 1 for Y, arbitrary? For example, would 
it have been less restrictive to choose y; = —1 if ¥, < cı, y, =0 if J, € 
[c1, Co], and y; = 1 if Y; > c2, treating cı and c2 as unobservable parameters? 


(b) Derive a Gibbs sampling algorithm whose blocks are B, h, and the un- 
observable ¥,. Show that it is ergodic with the unique invariant distribution 
being the posterior distribution of B, h, and the unobservable 9;. 


(c) Could this idea be extended to n > 3 ordered outcomes? 


6.3 THE INDEPENDENT FINITE STATE MODEL 


Often economic agents or entities can be characterized as being in one of a small 
number of possible states. For example, a sample of individuals from a population 
at a specific point in time might be classified as registered to vote with one of a 
few political party preferences, registered without preference, or unregistered; or 
households might be classified by the number of individuals in the household. If the 
probability that a particular individual (household) is in a particular state is inde- 
pendent of other observables and of the states of all other individuals (households), 
then the distribution of classifications is an independent finite state model. 
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This model is extremely simple. While we are sometimes interested in it directly, 
it is interesting primarily because it can arise as a constituent of a more compli- 
cated model. The independent finite state model will appear again in models with 
mixtures (Section 6.4) and the first-order Markov finite state model (Section 7.2). 

In the independent finite state model there are m possible states of the world 
that are occupied by n agents over T time periods; m > 1, n > 0, T > 0. For 
any agent k, let the integer są indicate the state occupied at time ft; 1 < sy; < m. 
The independent finite state model specifies that the NT random variables sg, are 
mutually independent, and 


Pls =j|%1,..-,Um, Aln (K=1,...,m,t=1,...,7; jf =1,...,m); 


of course, $`; 7; = 1. The observables can be collected in the n x T matrix 
S = [sx] and the unobservables arranged in the m x 1 vector m = (7,..., 2m)’. 
Turning to inference, define 


n T 
nj =X > Sse, j), (6.13) 


k=1 t=1 


the number of times that szt = j occurs in the sample. [The expression ô(skr, j) in 
(6.13) is an instance of the Kronecker delta function, defined (a,b) = 1 if a = b 
and (a, b) = 0 if a Æ b.] From (6.3), we obtain 


n T m 
P(S|x,A)=| [|] [rs = [|77 (6.14) 


k=1 t=1 j=l 


The number of occurrences n; (j = 1,...,m) constitutes a vector of sufficient 
statistics in this model. The likelihood function provides the kernel of the conjugate 
prior distribution of x. It is that of the Dirichlet distribution (Kotz et al. 2000, 
Section 49.1) for random variables x1, ..., Xm jointly distributed on the (m — 1)- 
dimensional unit simplex 


[sis 206= tom: Sonal, 


i=l 


with m positive parameters a1, ..., 4am and density 


Pik en a= f (£a) [Tre Ge aimee (6.15) 
i=l i=l i=l 


[Note that the beta(a;, a2) distribution (Casella and Berger 2002, Section 3.3) is 
the special case m = 2.] Thus the conjugate prior distribution of x is the Dirichlet 
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distribution with density 


pr |a)=|0] Soa; /T res Vie (6.16) 
jal 


j=l j=l 


Like all conjugate prior distributions, this one has a notional data interpretation; 
the information in the prior is analogous to that in yt a; —m observations in 
the context of the independent finite state model, with a; — 1 occurrences of state 
J G=1,...,m). 

The product of (6.14) and (6.16) is the posterior density kernel in standard form: 


m 


paS ax ]|r |Y a; Tre» Di (6.17) 
j=1 


j=l j=l 


Comparing (6.17) with (6.15), it follows that the posterior distribution of m is 
Dirichlet with parameters a; +n; (j = 1,...m). The posterior density combines 
prior and sample information in a direct and obvious way. This is a consequence of 
the fact that the Dirichlet distribution is a member of the exponential family (Defi- 
nition 2.3.3) and the general result (2.60) for posterior distributions with conjugate 
priors in that family. 

The corresponding marginal likelihood is the integral of the right side of (6.17) 
with respect to x. The value of the integral can be read from the normalizing 
constant in (6.15), and thus 


p(S| A)= (6.18) 


As a consequence of the simplicity of the independent finite state model, the 
posterior density and marginal likelihood have closed-form analytical expressions. 
In a direct application of the model, there may be no need for posterior simulation; 
see Exercise 6.3.1. When the independent finite state model is a constituent of a 
more complex model, however, it is useful to be able to draw from the Dirich- 
let distribution whose general form is (6.15). This can be done using a technique 
given in Devroye (1986), pp. 593—596—construct the independent random vari- 
ables d; ~ x?(2a;) (i = 1,..., m) and then take x; = di/ rin dj (J =1,...,m). 

BACC incorporates the independent finite state model using the conjugate prior 
distribution and posterior simulation algorithm described in this section. 


Exercise 6.3.1 Properties of the Posterior Distribution Suppose that S = $° is 
observed in the context of the independent finite state model A with observables 
probability distribution (6.14) and prior pdf (6.16). 
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(a) Derive an analytical expression for the moment E (IT ; ni | S°, A). 
j= 


(b) From the result in (a) express the posterior mean and variance of x j. 

(c) What is the value of the posterior mode? 

(d) Show that the marginal distribution of x; is beta(a;, $- jie j). [Hint: Con- 
sider the construction of Devroye (1986) and the reproductive property of 
the chi square distribution. ] 


(e) Determine the marginal posterior density of any pair of parameters (7;, 7%) 


Gk). 


Exercise 6.3.2 Empty Cells in the Independent Finite State Model Suppose 
that in a sample S°, nı = 0. 


(a) What is the maximum likelihood estimate of xı? Can you state the asymp- 
totic standard error associated with this estimate? 


(b) What is the marginal posterior distribution of mı, and what are the mean 
and standard deviation of this distribution? 


6.4 MODELING WITH MIXTURES OF NORMAL DISTRIBUTIONS 


The models for continuously distributed observables treated up to this point have 
repeatedly exploited the specifications 


B ~ N(B, H3’), (6.19) 
e ~ N(0, h'Ir), (6.20) 

in the relation 
y=Xß +e (6.21) 


between the observables X and y. As first noted in Section 1.1.2, the specifica- 
tion (6.20) is quite poor in some applications. It has nonetheless been maintained, 
to this point, because (1) the combination (6.19)—(6.20) leads to a normal condi- 
tional posterior distribution for 6 in all these models, and (2) we can generalize 
both (6.19) and (6.20), building on the tools developed in the process of treating 
(6.19)—(6.20) in the context of (6.21). This section turns to the generalization of 
(6.20), while that of (6.19) is taken up in Section 8.4. The keys to extending (6.20) 
are using latent variables and successive conditioning to create mixtures of normal 
distributions. In the process of posterior simulation the successive conditioning in 
the Gibbs sampler and related MCMC procedures recovers the underlying normal 
distributions. 
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6.4.1 The Independent Student-t Linear Model 


In many applications of the normal linear model there is substantial evidence that 
the probability of an unusually large or small value of the outcome y, is substantially 
greater than indicated by a normal distribution. This is a well-documented phe- 
nomenon in the case of financial asset returns; Section 1.1.2 provides an example. 
An alternative to (6.20)—(6.21) that better accommodates this phenomenon is 


Yr = BX; + £r, (6.22) 
e, | (h, à, A) AS t0, h7! a) (t= 1,..., T). (6.23) 


That is, each disturbance ¢, has a Student-t distribution with location parameter 0, 
scale parameter h™!, and A degrees of freedom. A standard representation for £; is 


e: =h; Pn, (6.24) 
with 
n | (h, A) F NO, h75 = 1,...,T7), (6.25) 
ah A, A E LOA CH yest: (6.26) 
The 2T random variables 4 = (nı, ..., nr) and h= (hi, TE hry are mutually 


independent. A convenient prior distribution for A, which may easily be generalized 
using the methods of Section 8.4, is the exponential with mean A: 


à ~ exp(a), (6.27) 
p(A) = A7! exp(—A/A). (6.28) 


Smaller values for à reflect assumptions that the distribution is more leptokurtic. 
From (6.22), (6.24), and (6.25) 


T T 
p(y |X, B, h, h, A) x h’? (1 i”) exp È Sohn - pa?a] , (6.29) 
t=1 t=1 


and from (6.26), we have 
ph, | A, A) = [PPT A/D] APRA? exp(—ah,/2) (t= 1,...,T). (6.30) 
Completing the model with £ | A ~ N(B, H;') and s*h | A ~ x? w), we obtain 


p(B | A) x exp[—(B — B)’H,(B — B)/21, (6.31) 
p(h| A) x he?’ exp(—s7h/2). (6.32) 
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In view of (6.29), it proves convenient to define the “reweighted” observables: 
y = AP yn x* =x,h;” (t =1,...,T) 
y" = (yf, ---, 9p)’, X* = Dx}, ..., xp’. 


Examining (6.31) and (6.29), it is apparent that the conditional distribution 
p(B | h,h, X, y°, A) has exactly the same form as in the normal linear model 
of Example 2.1.2, but with X* and y*? in place of X and y°: 


B | OÀ, h, y’, A) ~ NG, Hg), 
Hy =H, + AX"X*, B = H; H48 +X"y”). 
Similarly, from (6.32) and (6.29), we obtain 
Fh | (A,B, hy’, A) ~ x° 0), 
F = 5? + (Y° — X*BY Y” — X*B), =v +T. 


Note that the degrees of freedom parameter À is not involved in either distribu- 
tion. This is a consequence of the equivalent interpretation of h as a vector of 
parameters, and (6.26)—(6.27) as a hierarchical prior distribution for h. 

The product of (6.29) and (6.30) provides the kernel of p(h | A, $, h, X, y’, A), 
in which fi,..., hr are conditionally independent, specifically 


ph; | A, B, h, Y°, X, A) x hy?” exp{—[A + hy: — BX) Ihr /2}, 
implying 
[A + AO: — B'X:)*] h | A, B, h, y’, X, A) ~ PAF). 


The conditional posterior density kernel of à is proportional to the product of 
(6.28) and (6.30): 


T 
PO | B, h, ħ, y°, A) œ [PPT A/D] ae? (1 ee) (6.33a) 


t=1 
T. ~ 
-exp |- (2 +=) > i) J = k(A) (6.33b) 
t=1 


Clearly the kernel k(A) does not correspond to any common distribution. The 
second component (6.33b) is the kernel of an exponential distribution that could 
be used as a candidate distribution in a Metropolis within Gibbs step. However 
the first component (6.33a) is also relatively quite informative for A, leading to 
acceptance probabilities that in general are very low. Instead, take the candidate 


NIe 
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density g(A) to be that of a univariate normal distribution, with mean at the mode 
X of k(A) and precision equal to —d? log k(A)/da |,-z- This method is used in 
BACC, for models with Student-r distributions. Typically the degrees of freedom 
parameter, à, exhibits more serial correlation in the Gibbs sampling algorithm than 
do other parameters in the model, but the separated partial means test indicates 
satisfactory performance in simulations that can be computed in a few minutes; for 
more detail, see Exercise 6.4.4. 


6.4.2 Normal Mixture Linear Models 


The normal mixture linear model begins with (6.22) and then introduces the latent 


state vector $ = (5},..., 87)’. Conditional on X, $ obeys the independent finite 
state model of Section 6.3 with parameter vector m = (711,..., Tm) 
T m 
e T, 
plir, A) =] [r=] [r7 (6.34) 


where T; = $1 6(5;, j) is the number of observations t for which 5; = j. 

Corresponding to each of the m states j, there is a mean parameter œ; and 
a positive precision parameter h;; let œ~ = (a@,...,Qm)’ and h = (hı, ..., hn)’. 
Conditional on 5; = j, €&; ~ N[«;, (h -hj)7'). Thus 


ply: | Bh, n, 0, h, 3, = j, A] = (20)? (h- hj)? 
-exp[—h -hj(y; — a; — B'x:)*/2] (t= 1,..., T). (6.35) 


The disturbances €; are i.i.d. and follow a discrete normal mixture distribution: 


ple: | hm, æ, h, A) = (20) h"? X n jh}? exp[—h - hj(e, — a j)?/2]. 
j=1 


Clearly h and h are unidentified, in the sense described in Exercise 4.5.3, as is 
a if X contains a column of units, and states are not identified with respect to 
permutation of the state index. Identification issues will be taken up subsequently 
in the context of prior distributions. 

The mixture of normals distribution is very flexible. Figure 6.1 provides several 
examples. For the special case in which the means a; are all the same, the nor- 
mal mixture distribution is known as the “scale mixture of normals distribution.” 
That distribution is symmetric, is unimodal, and must be leptokurtic; that is, the 
coefficient of kurtosis K = E[e, — E(e,)|*/var(e;)" > 3, its value if £, is normally 
distributed (see Exercise 6.4.1). Panels (a) and (f) of Figure 6.1 provide examples 
of scale mixture of normals distributions. If the means a; are not all the same, 
then the normal mixture distribution can be skewed, as illustrated in panels (c) and 
(d). It can also be platykurtic (K <3), as is the case in panels (b) and (e). Of 
course, these distributions can be multimodal [panel (e)]. With a sufficient number 
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Figure 6.1. Several normal mixture probability density functions: (a) leptokurtic distribution, t(5) 
moments; (b) platykurtic distribution; (c) mildly right-skewed distribution; (d) severely right-skewed 
distribution; (e) bimodal distribution; (f) N (0, 1) plus 10% N (0, 57) outliers. In each panel the heavy 
line indicates a normal mixture probability density function. The lighter lines indicate the component 
normal density functions, each scaled by its probability. 


of components, the normal mixture distribution can mimic distributions that are 
quite different from the normal, like the uniform [panel (b)]. 

The conditionally conjugate prior densities in the normal mixture linear model 
are (6.31) for B and (6.32) for h. The other parameters in the model all pertain to 
the normal mixture distribution of ¢,. The choice of the prior distribution of these 
parameters is driven by three considerations: 


1. Priors should be conditionally conjugate and proper. Conditionally conjugate 
priors simplify simulation from the posterior, as first noted in Section 2.3, 
and these prior can be revised by the reweighting methods discussed in 
Section 8.4. As discussed in Section 3.2, improper priors lead to difficulties 
in model comparison. In the normal mixture model, improper priors can 
lead to the even more serious complication that the posterior distribution 
itself is improper; see West and Harrison (1989), Section 12.3.4, Roeder and 
Wasserman (1997), and Geweke and Keane (2001). 
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2. Since we have introduced no information to the effect that two states indexed 
by i and j should not be indexed by j and i instead, the states 5, are 
symmetric a priori. Prior distributions must incorporate this symmetry. A 
consequence is that the posterior distribution will also be symmetric, since 
interchanging the states does not affect the likelihood function. The pos- 
terior distribution will have m! symmetric components. In applications in 
which the different states have a substantive interpretation, this creates seri- 
ous conceptual complications (Celeux et al. 2000) and requires some explicit 
modifications of the posterior simulator like those suggested by Fruhwirth- 
Schnatter (2001). If, however, the only function of the latent states is to 
permit flexibility in the functional form of the probability density function, 
as is the case here, these issues are moot. 

3. It is easier to specify a prior distribution with a smaller number of hyperpa- 
rameters than a larger number. On the other hand, the range of hyperparam- 
eters must not be so narrow as to unduly compromise the flexibility of the 
normal mixture distribution. 


These considerations lead to a Dirichlet distribution with parameters aj = --- = 
am =r for m 


p | A) = Tmo | [a7 (6.36) 
j=l 


independent gamma distributions for the components of h, v.h; as xv) G= 
1,...,m) 


ph | A) = 2-2? ry /2ym Taye" exp(—vhj/2), (6.37) 
j=l 


and «æ | (h, A) ~ N[0, (A, -h)~'I,,], so that 
pla | h, A) = (20)? (h hy"? exp(—h ha'a /2). (6.38) 


The specification E(#) = 0 resolves the identification issues with respect to œ 
and 6 given that (as is usually the case) X has a column of units. The prior 
variance in B conveys uncertainty about the location of the distribution of y given 
X. The prior distribution of œ is scale dependent on h~'/?; that is, it states prior 
beliefs about the shape of the distribution. Keeping in mind that E(h) = tm, a 
prior distribution with hz" 2? — 5 implies a prior probability of multimodality that 
is near 1, whereas hy i i makes this probability negligibly small. Keeping in 
mind that E(@) = 0, choice of v, governs the prior probability of tail thickness 
in the mixture normal density relative to the normal. In the prior distribution the 
ratio hj/hg ~ F(v,, v.) for all j # k. If v, = 1, the prior probability of component 
variance ratios at least as great as those shown in Figure 6.1f is significant, whereas 
if v, = 5, it is negligible. 
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Posterior inference in the normal mixture model utilizes five blocks: y’ = 
(a’, B^), h, x, h, and S. It is useful to define 


1) =Z=(%,...,%) =B W =[Z x], 


Txm T x(m+k) 


y — y pol =H, = ; 
(m +k) xl B iby (m+k) 7 0 H; 


QG) = Q = diag(hs,,..., hz). 


TxT 


With this notation, (6.35) is equivalent to 


1/2 


pO |Y, h, x,h,3, X) = (2r) ™hT |Q] (6.39) 


-exp[—h(y — Wy)'Qiy — Wy)/2]. (6.40) 


The kernel of the conditional posterior density of y is the product of (6.31), 
(6.38), and (6.40), from which the conditional posterior distribution is 


y | (h, h, S, Y°, X, A) ~ N, Hy); 


H, =H, +W QW, 7 =H, [H,y + +hWQy’]. (6.41) 


The conditional posterior density of h is the product of (6.32), (6.38), and (6.35). 
This kernel corresponds to the conditional posterior distribution 


T 

z + hyo'at Y hz, Oi — oz, — As h|(y,h,8,y’,X,A)~ x°*wtm+T). 
t=1 

(6.42) 

The conditional posterior density kernel of x is the product of (6.34) and (6.36), 


m Tj-1 ts : Teeter eaa : 
I] g T 7” , and thus the conditional posterior distribution is Dirichlet with 


parameters r+ T; (j =1,...,m). 
The conditional posterior density kernel of h is the product of (6.37) and (6.35), 
which implies 


T 
[eyano — Qj -x| hj | (y, h,S, y’, X, A) 


t=1 


~x w +T) G =1,...,m). (6.43) 


The conditional posterior density kernel for the state assignments $ is the product 
of (6.34) and (6.35) taken over t = 1,..., T. Thus the states 5, are conditionally 
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independent, with 


P(S, = j) | (h, x, h, y’, X, A) 
ccm jhj!” expl—h-hy(y, — a; — B'xi)? /2 (G=1,...,m). (644) 


Draws from these independent finite state distributions are straightforward. 

BACC incorporates the normal mixture linear model using the conditionally 
conjugate prior distribution and posterior simulation algorithm described in this 
section. The output of the posterior simulator will, in general, reflect some switching 
between labels assigned to states. This poses no complications for problems in 
which functions of interest depend on the parameters œ, h, and m only through 
the pdf of y; — B’x;, as is always the case when the mixture of normals model is 
introduced solely to provide a flexible representation of this distribution. That is 
the case in the following example. 


Example 6.4.1 A Normal Mixture Linear Model for Earnings (The online 
appendix contains data, annotated code, and output for this example.) There is a 
long and well-established literature that studies the relationship between earnings 
and the determinants of earnings suggested by lifecycle human capital models. 
Going back at least to the work of Mincer (1958), the essence of these models 
is that an individual’s productivity, or human capital, is an increasing function of 
formal education and work experience. By far the most common measure of formal 
education is years of schooling, and the most common measure of experience is 
age. The panel study of income dynamics (PSID) is a household-based panel that 
has collected information on earnings and other aspects of economic activity. This 
example uses data collected in 1993 on the ages, levels of education, and earnings 
of 2698 white men between the ages of 25 and 65 who had earnings of at least 
$1000. It focuses on the distribution of earnings conditional on age and education. 

Economic theory provides little guidance on the functional form of this con- 
ditional distribution. Consistent with much of the human capital literature, the 
expectation of the logarithm of earnings (y,) is assumed to be a polynomial func- 
tion of age (a,) and education (e;), including all terms up to order 4 in age and 2 in 
education; thus, E (y; | ar, e, A) = X$ Des 6,4; . If the polynomial terms are 
organized in a 15 x | vector x, and the coefficients are arranged correspondingly 
in a 15 x 1 vector B, then we may write E(y, | x;, B, A) = ’x;, consistent with 
the general notation adopted for the linear model in Example 2.1.2. Section 5.4 
considers the question of specification of the order of the polynomial in the con- 
text of nonlinear regression; see in particular Exercise 5.4.6. Examples 8.3.1 and 
8.3.3 return to the issue of the adequacy of this formulation of the regression func- 
tion. The prior distribution derives from the assumption f’x, | A ne N(u, t2)(t = 
1,..., T), where u = 10.5, roughly the sample average of log earnings; t? = To”, 
where T is sample size and o = 0.7, roughly the sample standard deviation of log 
earnings conditional on age and education. The prior distribution of the precision 
parameter / is 8h | A ~ x7(10). The mode of this prior distribution is h = 3, 
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and the standard deviation is ~1.7. This completes the specification of a nor- 
mal linear model, which is useful as a comparison benchmark with the mixture 
models. 

The mixture models require the specification of the prior hyperparameters m, 
h,, v., and r. We consider mixtures of two (m = 2) and three (m = 3) normal dis- 
tributions, and in each case take h, = 0.4, v, = 3, and r = 1. This allows enough 
spread in the distributions to ensure a small but nonnegligible probability of mul- 
timodality in the density, while placing substantial prior probability on large ratios 
of component variances h;/hj;. In this and other complex models, prior distribu- 
tions can best be understood through their implications for the relevant functions 
of interest. Example 8.3.1 pursues this method in detail for this application. 

The normal model and 2 mixture models have almost identical implications for 
the posterior regression function E(y | a,e, A). Figure 6.2a provides one repre- 
sentation, for the mixture of three normal distributions. At all levels of education, 
expected log earnings is a concave function of age, with a peak at age 50 for 
college graduates (e = 16) and the early 50s for those who did not graduate from 
high school (e < 12). On the other hand, expected log earnings do not drop as 
rapidly after their peak for college graduates as they do for high school graduates, 
but rise more rapidly for young men. Returns to education, measured as the differ- 
ence between expected log earnings for college and high school graduates, steadily 
increase with age. 

The data strongly favor the mixture models as compared with a normal model. 
The log marginal likelihood for the latter model is —3056.1, whereas it is —2762.0 
for the mixture of two normals and —2757.2 for the mixture of three normals. The 
remaining panels of Figure 6.2 provide more detail for the conditional distributions 
and some insight into the marginal likelihood values. Panel (b) provides the poste- 
rior expectation of the pdf of the residual term y — f’x in the normal model. The 
darker line in panel (c) does the same for the mixture of two normals model and 
in (d), for the mixture of three normals model. The corresponding lighter line in 
each of these panels provides the posterior expectation of a normal density with 
the same mean and variance as in the mixture of normals distribution. The striking 
similarity of the normal densities in panels (b), (c), and (d) can be interpreted as 
a consequence of Theorem 3.4.2 and Example 3.4.3; the pseudotrue normal model 
will have the mean and variance of the assumed data generating process D. [For 
a variant on this approach that uses all three models simultaneously in a single 
MCMC algorithm, see Richardson and Green (1997).] 

The normal mixture distributions in panels (c) and (d) have mean zero. They 
are strongly negatively skewed, as indicated by the thicker left tail and the positive 
mode. They are strongly leptokurtic, as indicated by a modal value substantially 
higher than that of the normal distribution, as well as the thicker tails. The mixture 
of normals densities in panels (c) and (d) are strikingly similar. The model and 
data do little to exploit the additional flexibility provided by a third component to 
modify the two-component density. Nevertheless, the sample of 2698 observations 
provides decisive evidence in favor of the mixture of three normals model over the 
mixture of two normals model, with a Bayes factor of 121.5. 
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Figure 6.2. Some aspects of the posterior distribution of log earnings conditional on age and education: 
(a) expected log earnings conditional on age and education; disturbance pdf assuming normal distribution 
(b), mixture of two normals (c), mixture of three normals (d). 


Distributions can be summarized in many ways. The coefficients of skewness 
and kurtosis are the most common representations of the third and fourth moments. 
In any given application there may be more substantive summaries, as well. In this 
example, the conditional distributions indicate the magnitude of inequality in earn- 
ings, given age and education. A widely used measure of inequality is the Gini coef- 
ficient, which derives from the Lorenz curve. The Lorenz curve L(p), defined on 
the unit interval, is the fraction of total earnings accruing to individuals in earnings 
quantile p or lower. If all individuals have the same earnings, then L(p) = p and 
in general L(p) < p. The Gini coefficient is G = 2 fs [p — L(p)]dp; G € [0, 1] 
with G = 0 if and only if all individuals have the same earnings and G = 1 if 
and only if all earnings accrue to one individual. We consider two other measures 
of inequality: P, the fraction of men with earnings less than one-half of median 
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earnings; and R, the fraction of earnings accruing to men in the top decile of the 
earnings distribution. 

The distribution of y — B’x in the mixture of three normals model is more 
negatively skewed and more leptokurtic than in the mixture of two normals model, 
but only slightly so. By contrast, the coefficient of skewness is 0 and the coefficient 
of kurtosis is 3 in a normal model. The measures of inequality are nearly identical 
in the two mixture models. These results reconfirm the fact that addition of a third 
component to a mixture of two normals distribution changes essentially nothing 
in this example. By contrast, all three measures of inequality are substantially 
higher in the normal model. Panels (c) and (d) of Figure 6.2 indicate the reason 
why. The assumed normal distribution, in capturing the mean and variance of 
the mixture of normals distribution, shifts mass away from the mean, rather than 
toward the mean, except in the extreme tails of the distribution. [See panels (c) 
and (d) of Figure 6.2.] The former effect dominates the latter in all three measures 
of inequality, whereas the converse is true in the coefficients of skewness and 
kurtosis. 


Measure Model Median Interquartile Range 
Coefficient of skewness Mixture (2) —1.063 (—1.527, —0.868) 
Mixture (3) —1.340 (—1.599, —0.981) 
Coefficient of kurtosis Mixture (2) 6.263 (5.871, 6.722) 
Mixture (3) 6.529 (6.177, 6.896) 
Gini coefficient G Normal 0.398 (0.393, 0.404) 
Mixture (2) 0.345 (0.337, 0.353) 
Mixture (3) 0.345 (0.337, 0.355) 
Low earnings P Normal 0.174 (0.168, 0.179) 
Mixture (2) 0.142 (0.137, 0.148) 
Mixture (3) 0.150 (0.145, 0.156) 
High earnings R Normal 0.294 (0.290, 0.299) 
Mixture (2) 0.264 (0.256, 0.272) 
Mixture (3) 0.266 (0.256, 0.276) 


6.4.3 Generalizing the Observable Outcomes 


Recall that in the censored linear model (Section 6.1) the continuous outcome 
variable y, is latent (unobservable). The observable outcome is a set-valued func- 
tion C, = c;(¥;), mapping each possible outcome into exactly one set (6.2) having 
the property y, € C,;. The probit model (Section 6.2), the censored linear model, 
and more general censoring of outcome variables (see Exercise 6.1.1) are spe- 
cial cases. The fully observed linear model of Example 2.1.2 is the trivial special 
case C; = Y. 

This same strategy may be used to extend both the Student-t and normal mixture 
linear models. The distribution of y, conditional on all parameters, latent variables 
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and X, but not C = {C),..., Cr}, is 
3; | (B, h, h, X, A) ~ NIB'x;, (h - h)! (6.45) 
in the extension of the Student-t linear model, and 
Ş, | (x, h, h, 3, X, A) ~ N[y'w,, (h - hz) '] (6.46) 


in the extension of the normal mixture linear model. The probability distribution 
of observables is given by (6.4), repeated here for reference: 


F T 
PC IÐ =[ [C15 =] e0. (6.47) 
t=1 t=1 


In posterior simulation the generalization introduced in Section 6.1 requires only 
one additional step, here, as it did there. In the case of the Student-t model, draw 
from (6.45) subject to the constraint Y; € C?, and in the normal mixture model 
draw from (6.46) subject to the same restriction. 

This generalization leads to a wide class of models, many on the frontier of 
current econometric research; for further discussion, see Section 5 of Geweke and 
Keane (2001), and for applications see Chib and Greenberg (1995) and Rossi et al. 
(2001). All of the variants discussed in this section are incorporated in BACC. 


Exercise 6.4.1 Properties of the Normal Mixture Linear Model Consider the 
disturbance term, £; = y; — B’x;, in this model. 


(a) Show that if a] =---=a,, = 0, then the distribution of the disturbance is 
symmetric, unimodal, and leptokurtic. 
(b) Show that if hı =---=h,,, then the distribution of the disturbance can be 


either leptokurtic or platykurtic. 
(c) Show that if X is any random variable, there exists a sequence of random 


: : : Peete d 
variables X,, each with a normal mixture distribution, such that X, > X. 


Exercise 6.4.2 Ergodicity Consider the posterior simulation algorithm for the 
normal mixture linear model presented in this section. 


(a) Show that the Markov chain is ergodic. 


(b) Is the Markov chain uniformly ergodic? If so, indicate why. If it is difficult to 
demonstrate uniform ergodicity, try to develop a modified algorithm that is 
uniformly ergodic using the methods described in Section 4.6.1. [Diebolt and 
Robert (1994) investigate ergodicity problems for this and similar algorithms 
in mixture models. ] 
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Exercise 6.4.3 Interval Data A historian is using Army recruiting records from 
the American Revolutionary War to learn about the distribution of height (y,,) and 
weight (y2;) among young men in the late 1700s in colonial America. She has the 
following information. 


(a) For nı individuals who were accepted as Army recruits, she knows both 
height and weight. 

(b) For n2 individuals who were accepted as Army recruits, she knows height 
but not weight. 

(c) For n3 individuals who were accepted as Army recruits, she has weight but 
not height. 

(d) She knows that there were n4 individuals accepted as Army recruits, but for 
these individuals show knows neither height nor weight. 

(e) She knows that ns individuals were rejected as recruits because they failed 
to meet height and weight standards. 


(£) She knows that the height standard was c11 < yi; < cio and the weight 


standard was c21 < yx; < C22; she knows all four c;; values. 


The historian is willing to make the following assumptions: 

1. The population from which recruits were either accepted or rejected is a 
random sample of young men in colonial America. 

2. The only reason for rejection was failure to meet height and weight 
standards. 


3. All missing data are missing completely at random (recall Example 2.2.3). 


4. For yi = Oir Y2)’, yr N (p, Ho). 

5. The prior distributions of u and H are independent, the prior for m is 
normal, and the prior for H is Wishart. 

The historian’s immediate objective is to construct a posterior simulator 

for p and H. Show how to do this, using a Markov chain Monte Carlo 

algorithm. 


Exercise 6.4.4 Class Size and Test Scores Revisited Example 5.1.1 assumed 
that the distribution of test scores conditional on covariates is normal. Consider 
two alternatives: that this distribution is i.1.d. Student-t, and that it is i.i.d. normal 
mixture with two components. 


(a) For each alternative, set up conditionally conjugate prior distributions that 
are comparable to those in Example 5.1.1. In each case, defend your choices 
of prior distributions for the additional parameters. 

(b) For each alternative distribution, compute the Bayes factor in favor of that 
alternative, versus the original specification. 

(c) Regardless of the Bayes factor in (b), work through the decision prob- 
lems of Example 5.1.2 under each alternative distribution. Are the results 
significantly affected? Why or why not? 
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Exercise 6.4.5 Outliers The problem of “outliers” in regression is a conventional 
topic in many regression courses. One approach to outliers is to assume that 


Y: | (B, h, h*, A) ~ N(B'x;, h~’) (6.48) 
if y; is not an outlier and 
yr | (B, h, h*, A) ~ N (B'xi, h7’) (6.49) 


if y; is an outlier, together with the idea that h* « h. Outliers typically constitute 
only a small fraction of an entire sample. 


(a) Suppose that we knew which observations were outliers and which were 
not. State the conditional posterior distribution for B. Show that for given 
h, as h* — 0, the posterior distribution effectively ignores the outlier obser- 
vations. 


(b 


<~ 


For the rest of this problem, assume that we do not know which observations 
are outliers and which are not. Carefully state a complete mixture of normals 
linear model that incorporates all the features of outliers stated at the start 
of this problem. Be as specific as possible about parameters in the prior 
distribution that would best incorporate these features. 

(c) Briefly describe a Markov chain Monte Carlo algorithm for the posterior 
distribution in (b). Show how the algorithm yields, as a byproduct, the 
probability that each observation is an outlier. 


(d 


Na 


Suppose that (6.48)—(6.49) is, indeed, the data-generating process. Suppose 
without loss of generality that h = 1. Sample size T is fixed. Show that as 
h* — 0, the algorithm in (c) will correctly classify each observation as an 
outlier or not. 

[For more on this approach to outliers, see Chaloner and Brant (1988) and 
Smith and Kohn (1996).] 


Exercise 6.4.6 Specification of the Regression Function in Example 6.4.1 
Repeat the analysis in Exercise 5.4.6, but assuming a normal mixture linear model. 


Exercise 6.4.7 The Earnings Example Extended This exercise extends 
Example 6.4.1. 


(a) The example focused on five properties of the conditional distribution: 
skewness and kurtosis, and the measures of inequality G, P, and R. The 
example also showed that earnings vary systematically with age and educa- 
tion. Assuming that the relevant distribution of age and education is given 
by the sample of 2698 men used in the example, find posterior medians and 
interquartile ranges for the unconditional distribution of log earnings. 


(b 


<~ 


What is the probability that a 40-year-old man with 16 years of education 
has higher earnings than a 40-year-old man with 12 years of education? 
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(c) The example reported medians and interquartile ranges for skewness and 
kurtosis. Try to compute posterior means and variances for skewness and 
kurtosis, identify the problem that results, and attempt to rectify it. (Hint: 
The prior distribution guarantees the existence of some, but not all, posterior 
moments. ) 


CHAPTER7 


Modeling for Time Series 


In many decisionmaking problems the vector of interest is future (and therefore 
unobserved) values of time series. Section 1.1.2 introduced one such problem, 
assessing value at risk. The common structure of all of these problems is that the 
distribution of the vector of interest œ = (yri1,..., yr)’ is inherent in the model’s 
specification of p(y, | Y;-1,94, A) (t = 1,2,...): 


F 
p@|¥7,04,4)= [| ply | ¥-1, 64, A). (7.1) 
t=T+1 
Then, as always 
p(@ | Y7, A) -=f pæ | Y7,04, A)p(Oa4 | Y7, A)dv(04). (7.2) 
Oa 


The central technical problem is construction of the posterior simulator ge” ~ 
p(0a | Y%, A). Given this, draws from p(@ | Y$, A) require only the forward sim- 
ulation of the model evident in (7.1). Because of the consistent conditioning in 
(7.1)-(7.2), uncertainty about parameters or other unobservables 04 and uncer- 
tainty about the future conditional on 0,4 is integrated in seamless fashion. For 
further development of this idea and comparison with other methods, see Geweke 
and Whiteman (in press). 

As emphasized in Chapter 1, this conditioning is congruent with the circum- 
stances of the decisionmaker, who must proceed on the basis of information Y$ 
and A that is available. This fact, combined with the development of posterior 
simulators that make (7.2) practical, has led to vigorous growth in Bayesian mod- 
eling for time series and the application of these models in forecasting, portfolio 
allocation, and other decisionmaking contexts. Geweke and Whiteman (in press) 
review this literature. This chapter provides technical detail for three time series 
models, each illustrating a different significant set of the tools that have proved 
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useful in this endeavor. Section 7.1 develops Bayesian methods for autoregressions 
using the exact likelihood function for the stationary case. Section 7.2 turns to the 
first-order Markov model, which is the most commonly applied model for discrete 
time series. Section 7.3 uses this model in combination with latent variables and 
normal distributions to construct a leading simple yet general model for conditional 
dependence in time series. 


7.1 LINEAR MODELS WITH SERIAL CORRELATION 


Suppose that in the linear regression model introduced in Example 2.1.2 and used 
throughout Chapter 2 the covariates and dependent variable are time series, each 
measured at a point in time or as averages over successive intervals. If in continuous 
time these variables move smoothly without jumps, then as the sampling interval 
becomes shorter and shorter, the assumption that the disturbances ¢, = y, — B’x; 
are mutually independent becomes untenable. 

This section takes up a modification of this model that weakens the assump- 
tion of independence, replacing it with the assumption that £; obeys a stationary 
autoregressive process of order p. This modification specifies 


Yi = BX, + £r, (7.3) 

p 
e, = Db 81-5 + ur, (1.4) 

s=1 

iid. -1 

Ur | (€;-1, €-2,---) ~ NO,h~), (7.5) 
for all periods t = 1,..., T. Moreover, ¢; is stationary—that is, for any set of 
S1, -.-, Sm, the distribution of the vector (€r, &:~s,,..-, Et-s„) does not depend 
on t. The assumption of stationarity is important in addressing the complication 
introduced by the fact that the observables are (y1, X1), ..., (YT, Xr), and (7.4) 
introduces €_p41,...,€ in addition to £, h, and ¢ = (@,, wane Das A necessary 


condition for stationarity is @ € Sp C R”, where 


p 
S= lo TEDA 
s=l 
and z is complex. 


Motivated by (7.4), define 


ZOVZ: |z| < 1| 


p P 
yr=y— > bys and xP =x, — Dox, (t=pt+l,...,T), 
s=1 


s=l 


and take y* = Ohy nse Vea = Xa ...,Xp]’. Then 


y“ | (B,$,h, X, A) ~ NX*B, h'Ir-p). 
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Let y, denote the first p elements of y and X, the first p rows of X. Then 
Yp | (B, $, h, X, A) ~ NIX,B, hV, O). (7.6) 


The p x p matrix V,(@) will be derived shortly. From (7.5), y* and y, are indepen- 
dent conditional on (8, @, h, X), and the Jacobian of the one-to-one transformation 
between y’ and (y, y*’) is one. Hence 


P |B, $, h, X, A) = Qr) ThT |V,,()| 1” 


-exp {—h [(y* — X*B)'(y* — X* B) 
+ (Yp —XpB)'Vp(b) Yp — XpB)] /2}. AD 


An alternative expression for the observables density emphasizing the role of @ 
begins with £, = y; — B’x, from (7.3). Define 


Ep+1 Ep El 
Ep+2 Ep+1 E2 
P p+ 
e* = and E= : (7.8) 
ET ET-—1 ET—p 


Then (7.7) becomes 
ply | B,,4, X, A) = (20) Ph’ exp[—h(e* — Ed) (e* — Ed)/2]  (7.9a) 
-|V,(@)| 7 exp[—A(yp — X BYV Yp — XpB)/21. (1.9) 


Because €, is stationary, cov(€;, €j) depends only on |i — j|, and so the (i, j) 
entry of V ($) may be expressed vj;_ ;;. From (7.4) 


holy; = cov(é;, Et—j | $, h, A) 


P 
= $ b,cov(ers, €j | Ø, h, A) + cov(ur, er; | $, h, A) 


ael 


P 
=h! $ bsvj-s +80, fre (7.10) 


s=l 


for all j > 0. Evaluating (7.10) for j =1,..., p leads to the p Yule—Walker 
equations 


= (7.11) 
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If @ € Sp, then (7.11) determines vo, vj,..., Up—1 up to a scaling factor, and the 
p x p matrix in (7.11) will be positive definite if vọ > O and negative definite if 
vo < 0. Evaluating (7.10) for j = 0 yields 


P 
vw => bus +l, (7.12) 
s=} 


which determines the scale factor; @ € S, implies vp > 0. 

The posterior simulator in this model is similar to that first proposed by Chib 
(1993). [For an extension to autoregressive-moving average models, see Chib and 
Greenberg (1994), and for extensions involving missing data, see Barnett et al. 
(1996).] The kernel of (7.7) in B indicates that the conditionally conjugate prior 
distribution of B is normal, B ~ N(B, H;'): 


1/2 


p(B | A) = (2) *”? ||“ exp[—(B — B)'H,(B — B)/21. (7.13) 


That of h is a gamma distribution, s*h ~ x? (v) : 
ph | A) = 2-272 (v/2) (s)he. exp(—s7h/2). (7.14) 


Examining (7.9a)—(7.9b), it is evident from the presence of V,(@) in (7.9b) that a 
conditionally conjugate prior distribution for @ would involve the awkward func- 
tional forms lV Pp (@)| and V, (@)—!. On the other hand, the kernel of (7.9a) in @ is 


normal. This suggests a prior distribution @ ~ N(@, H;,') truncated to the set S, 


p$ | A) = (20)? Dp, Hy) |Hy |” 


-expl ($ — $)'Hy(@ — $)/2]1s, (6), (7.15) 


where 
D@, Hy)! = 2r)? |, |” f exp|—(¢ — 6)'Hy(¢ — $)/21dġ. 


A Gibbs sampling algorithm with a Metropolis step can simulate the unobserv- 
ables in this complete model. The posterior density kernel is the product of the 
prior densities (7.13), (7.14), (7.15), and the observables density expressed in either 
of the forms (7.7) or (7.9a)—(7.9b). The conditional posterior density kernel of B, 
from (7.7) and (7.13), is 


p(B |h, $, y’, X, A) œ exp{—[(B — B)'H,(B — B) 
+ hy”? =, X*B)'(y*” = X*B) 
+ h(y?, — X B) YOT Y, — XpB)1/2}. 
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Hence 
— —1 
P | (h, ġ,y°,X, A) ag N(B, Hg ); 
Hyg = Hy + hX”X* + hX',V,(@) Xp, 
pun =f 120 
B =H, [H,8 + hX”y”? + hX V o) y2]. 
Similarly, the posterior density kernel for h, from (7.7) and (7.14), shows 
Fh | E4, yX, A ~x O; V=v+T, 
3? = s? + (y"? — X* BY Y” — X*B) + O2- XpBY' Vp) Y, — XpB). 
From (7.9a)—(7.9b) and (7.15) the conditional posterior density kernel of @ is 
p(Gl B. hy’, X, A) x exp{—[(¢ — 6)'Hy(@ — $) 
+ h(e* — E°p)'(e*” — E°d)]/2} (7.16a) 
-r(B, h, ġ)Is, ($), (7.16b) 


where e*? and E* are defined by substituting £? = y? — B’x, for s, in (7.8) 
(t=1,...,T) and r(ß, h, @) is expression (7.9b) after substituting y° for y. The 
distribution corresponding to the kernel of (7.16a) in @ is 


A ee 
@ | (B, h, y’, X, A) ~ NO, Hy ), (7.17) 
where 
Hy =H, +hE’E’, $ =H, Hpo + hE”). 


At iteration m a Metropolis within Gibbs step (see Section 4.6.2) for @ draws a 
candidate @* from the distribution (7.17), using the current values B m of B and 
h™ of h. From (7.16b) the acceptance probability for the candidate is 
| (BA, Is, ($*) 
min | ——————————___,, 1|. 
r(p™, he), go?) 


BACC incorporates the linear model with serial correlation using the conjugate 
prior distribution and posterior simulation algorithm described in this section. 


Exercise 7.1.1 A Linear Model with Serial Correlation and Missing Data 
Suppose y, = B’x, + £, (t = 1,..., T). The disturbance €; is stationary and obeys 
the first-order autoregression 


iid. = 
Ep = p&a1t uy, uy | (61-1, 1-2,--.) ~ N (0, h7’). 
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Prior beliefs are given by the three independent distributions 
B ~ N(B, H3"), sh ~ x7(v), p ~ uniform (1, 1). 


(a) Design a Markov chain Monte Carlo algorithm for Bayesian inference in 
this model. 


(b) Now suppose that some of the observables y, are missing at random (recall 
Example 2.2.3). The covariates x, are always observed. Modify the Gibbs 
sampling algorithm designed in part (a) to accommodate this complication. 


Exercise 7.1.2 Marginal Likelihood in the Linear Model with Serial Corre- 
lation The prior pdf of @ is (7.15). 


(a) Explain how to approximate D(ġ, Hy) by means of direct simulation. 


(b) Why is the value of D(g, Hy) important in evaluating the marginal likeli- 
hood of this model? 


7.2 THE FIRST-ORDER MARKOV FINITE STATE MODEL 


Often economic agents or entities can be characterized as moving through states 
over time, being in exactly one of m states in each time period. For example, 
an individual might be employed, unemployed, or out of the labor force; or, an 
individual might be married or not married. If the probability of an entity being in 
a particular state in a period depends only on the state occupied by that entity in the 
previous period, the model is a first-order Markov finite state model. These models 
are of interest not only for their own sake but also because they frequently arise 
as important constituents of more complicated models, for example, the Markov 
mixture of normals model discussed in Section 7.3. 

There are two variants of this model. In the nonstationary first-order Markov 
model the probability distribution of agents or entities over states is different from 
one period to the next, but, given weak side conditions presented below, converges 
to a limiting invariant distribution. In the stationary first-order Markov model the 
unconditional probability distributions across states are the same in each period. In 
both variants of this model individual agents move among states and the dynamics 
of this movement are nontrivial and usually a focal point of study. The stationary 
first-order Markov model corresponds more closely to assumptions about behavior 
in many economic applications, and when the first-order Markov model is used as 
a constituent of more complicated models, stationarity may be essential. 

In either variant, the first-order Markov model may be regarded as a generaliza- 
tion of the independent finite state model. The observables are the same: Sgt, the 
state occupied by entity k at time f, collected in the n x T matrix S. As in that 
model, the state transitions between time periods ¢ are mutually independent and 
identically distributed across agents k. 
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In both the stationary and nonstationary models, for any agent k = 1,...,n, 
state j = 1,...,m and time period t = 2,..., T, we have 


Pise = j | Skr-1 =i, Sey(u < t — 1), A] 


= Psu =] | Skr-1 =i, A) = pij. (7.18) 

Let 
Pj = (Pj -+s Pay (7.19) 

and 
P = [pi;] = [pis <- - , Prl- (7.20) 


In the nonstationary model, the initial period distribution for any agent k (k = 
1,..., n) is 


P(sa=j|A) =m; (7.21) 


(This section returns to the stationary model in more detail subsequently.) 
Expression (7.18) provides the probability distribution across states for an agent, 
conditional on that agent’s history. Not conditioning on this history, denote 


P(Skr = j | A) = Tij. (7.22) 


Then from (7.18), (7.21), and (7.22), m1; = >; pij%r-1,5 (j = 1,...,m); equiv- 
alently, m; = x;_ P, where m; = (111, ..., Mim). For any s < t, a, = a'_,P*, and 
in particular 


nx =P", (7.23) 


The eigenvalues and eigenvectors of the transition matrix P are important for 
the properties of the model. Denote the eigenvalues by 41,...,A, ordered so 
that |A;| >--- > àm]. We shall assume that P is diagonable; that is, it may be 
represented P = CAC™!, where the columns of C are right eigenvectors of P and 
the rows of C7! are left eigenvectors of P. [If the prior distribution of P is absolutely 
continuous—as is the case for the prior distribution employed subsequently in 
this section—then P is diagonable with probability one. Necessary and sufficient 
conditions for a nonsymmetric matrix to be diagonable can be found in many 
linear algebra texts, for example, Schott (1997), Section 4.4.] The eigenvalues of 
P cannot exceed 1 in modulus, because from (7.23) 0 < tr(P/) < m and tr(P/) = 
yi A}. But since iat Pig =1V I =1,...,m, Pim = tm; tm = (1, .--, 1) isa 
right eigenvector of P corresponding to an eigenvalue 1, and it is convenient to 
take Ài =-1; 
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A probability distribution over the m states m is an invariant distribution if 
x’ = n'P. The vector x must be a left eigenvector corresponding to an eigenvalue 
à = 1. If [Ai] > Aol, then this invariant distribution is unique. Suppose instead 
that |A2| = 1. If Ay = 1, then the Markov chain is reducible, with invariant states 
depending on the initial distribution. Examples are P = Ip and 


l-p2 pr 0 


P= pu 1—pa 0 
0 0 1 
If 42 = —1 or |A2| = 1 and Az is complex, then the chain is periodic. Examples 


include 


For the prior distribution employed subsequently in this section |A2| < 1 with prob- 
ability 1. In this case the Markov chain is irreducible and aperiodic, and hence has 
a unique invariant distribution. The eigenvalue àz then provides an upper bound on 
the rate of convergence to the invariant distribution, as indicated in the following 
result. 


Theorem 7.2.1 Convergence in the First-Order Markov Model Suppose that 
the first-order Markov m-state transition matrix P is diagonable with eigenvalues À; 
and |A;| > -+-+ > [àm]. Suppose also that |A2| < 1, and denote the unique invariant 
distribution by x. Then for any r : |A2| < r < 1, lim; sœ r ™ (m, — m) = 0. 


Proof: Let P have the diagonalization 
P = CAC™!, A = diag(ài, ..., Àm). 
Since x} = x/_ P = x{P^! anda’ = x'P = x'P'!, it follows that 


x! — n' = (m; — x) P = (m; -2)'CA'C! 
= (mı -x)CA''C? = m CAIC! (1.24) 
where A = diag(0, A2,..., Am). [The third equality in (7.24) follows because the 


first column of C is proportional to tm and the last follows because x” is proportional 
to the first row of C~!.] Then 


r (m, = m) =n CATO! =r a cer tay ico 


Since lim,_,.9(r7! A)! = 0, lim, p r™ (m; — T) = 0. E 
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Note that if all A; for which [a jl = |A>| are real and positive, then, from (7.24), 
we obtain 


lim |Ag|7! (r; — x) = [Aol CL lim (A217! ACT! 
t>0o t—> 00 
= |2|! x4 CDC7! =v 


where D = diag(dı, ..., dm) and dj = 5(\A; , |A2|). For any positive integer h 


lim |Ag|~' Eryn =) = lim |AQ|" r, — x) = JA’ v. 
t—>œ t—> 00 


If h = — log 2/ log |Ag|, then |A2|" = 5. This value of h is known as the half-life of 
the first-order Markov model. (The definition still applies for second-largest roots 
that are negative or complex, but in that case the limit does not exist as it has 
been taken here, and the result is in terms of amplitudes of oscillations about the 
invariant distribution.) 

While the entries p;; of P completely characterize the first-order Markov finite 
state model, they are not as directly related to the implied dynamics as some 
functions of these parameters. The invariant distribution æ and the convergence 
bound |A2| are examples. There are also many measures of mobility between states, 
including the expected length of stay in state i, (1 — p;;)~', and the overall measure 
of mobility [m — tr(P)]/(m — 1). For further discussion and properties of these 
measures, see Geweke et al. (1986). 


7.2.1 Inference in the Nonstationary Model 


Recall that the observables are collected in S = [sr], where sgr is the state occupied 
by entity k at time t. From (7.18) and (7.21), we obtain 


n T 
P(S|x,,P, A) =] | (ma. I] Fives) (7.25) 
t=2 


k=1 


Let nj = Xia ô(skı, J) denote the number of agents in state j at t = 1. Let njj = 
Y Sia ô(Sk,t—1, 1)5(Sxr, J) denote the number of observable transitions from 
state i in one period to state j in the next period. Then (7.25) may be expressed 


m m m 


pS|r.P,A)= |] [ru I] [27 J- (7.26) 
j=l 


i=l j=l 


Observe that (7.26) is the product of m + 1 components: 


m 


pSlm, P, A =] [a [p TL oe. (1.27) 
j=l j=l 


j=l 
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Each of these terms has the same functional form as (6.14). Formally, there are 
m + 1 independent finite state models in (7.25) and (7.27): one for the first period, 
and one for each of the m states on which the transition probabilities are condi- 
tioned. Just as in (6.14), the probabilities are nonnegative and must sum to one in 
each model. 

Since the likelihood function in (7.27) has m + 1 factors, the conjugate prior 
distribution will have m + 1 corresponding independent components. Moreover, 
these conjugate prior distributions will all be Dirichlet, as shown in Section 6.3. 
Thus the conjugate prior density is 


m m m aj-1 
pi | Ay=|T | ><a; /Tre Fie: (7.28) 
j=l j=l 


j=l 
m m m m m m a 
pP|Ay=|[ [Pr] doa, /TIT re P e. 0.29 
i=l j=l i=l j=l i=l j=l 
where a; > 0 (i =1,...,m) and aj; > 0 (i, j =1,...,m). The support is the 


Cartesian product of m + 1 (m — 1)-dimensional unit simplexes, one each for x; 
and the rows pj,..., Pm of P [recall (7.19)—(7.20)]. 

It follows immediately from (7.28)—(7.29) and (7.27) that in the posterior dis- 
tribution the m x | vectors 11, pi,... and Pm are mutually independent, each with 
a Dirichlet distribution: 


m m m 
(aj+n9—1) (aij-+n?;-1) 
pæn PIS Aya] [ay TTT] a (7.30) 


j=l i=l j=l 


The Dirichlet posterior distribution of xı has parameters a; + nS (j =1,...,m) 
and the posterior distribution of p; (i =1,...,m) has parameters aij +n?; 
(j =1,...,m). Derivation of a closed-form expression for the marginal likelihood 
of the model is straightforward and is left to Exercise 7.2.4. 


7.2.2 Inference in the Stationary Model 


In the stationary model, m, = x for all time periods t, equivalent to the restriction 
xı = m. If the transition matrix P is irreducible and aperiodic, then |A| < 1 and 
there is a 1 x m left eigenvector c!, unique up to an arbitrary scale factor, with the 
property c'P = c'. Computation of c! given P is standard, and x’ is c! normalized 
so that its elements sum to one. (The elements of c! will all be nonnegative; see 
Exercise 7.2.1. Exercise 7.2.2 provides an alternative method for finding m given 
P.) Denote this mapping r (P) = [77 (P),..., 7% m(P)]’. As long as the rows of P 
have absolutely continuous distributions on the unit simplex—as is the case for 
Dirichlet distribution—then, with probability one, |A2| < 1 and x may be computed 
in this way. 
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Given the stationarity restriction, the likelihood function is 


m m 


psie a= (TJn ) (TITT). 


j=1 i=l j=1 
Retaining the prior density (7.29) for P, the posterior density kernel is 


m m 


pP|s’, A ~ [T] x ey" T N (7.31) 


j=l i=Lg=l 


The kernel (7.31) does not correspond to any standard distribution function, but 
its second component is the product of Dirichlet probability density functions for 
P1, ---> Pm, while its first component is bounded above. This suggests three closely 
related methods of sampling from the posterior distribution. 


1. Importance Sampling. Draw p; from a Dirichlet distribution with parameters 
ail tni <--> Aim +n? (i = 1, ..., m). The weight associated with the draw 
. n° 
is I],_.77® J 
2. Acceptance Sampling. The largest possible value of the first component on 
m . 
the right side of (7.31) is IT, ate where Tij =1n“/ 7/1, np. The source 
density is the same as the importance sampling density, and the acceptance 
m o 
probability is gis (a j(P)/71;)"". 
3. An Independence Metropolis—Hastings Algorithm. The probability distribu- 
tion of the candidate P* density is the same set of independent Dirichlet 


distributions used for the draws in importance and acceptance sampling, and 
the acceptance probability is 


min [m P/r; Pr". 17, 


j=l 
where P is the value in the previous iteration. 


The efficiency of these algorithms will depend on how close the observed distri- 
bution of entities across states at t = 1 is to the invariant distribution corresponding 
to transition matrices P that are probable given the subsequent state-to-state tran- 
sitions. Loosely speaking, the metric for measuring “close” is the probability ratio 
of the ¢ = 1 outcome under the nonstationary model specification to that under the 
stationary model specification. For an algorithm that can be more efficient than 
any of these, see Exercise 7.2.3. 

BACC incorporates both the stationary and nonstationary first-order Markov 
finite state models using the conjugate prior distributions and posterior simulation 
algorithms described in this section. 
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Exercise 7.2.1 Properties of the Leading Left Eigenvector of P Suppose that 
P = CAC‘! is irreducible and aperiodic. 


(a) Show that lim,_,..P’ = cıc!, where c; «t,, is the first column of C (first 
right eigenvector of P) and c! is the first row of C7! (first left eigenvector 
of P). 


(b) Show that the elements of c! must be nonnegative. 


Exercise 7.2.2 Computation of the Invariant Distribution m This exercise 
develops a method of obtaining x from P that avoids computation of the eigen- 
vectors of P. 


(a) Show that when P is irreducible and aperiodic, m is the unique solution of 
the system of m + 1 linear equations in m unknowns Ax = b, where 


slit aes) 
Un 1 


(b) From (a), deduce m = (A’A)~!A’b. 
(c) Show that x is the sum of the columns of (A’A)~!. 


Exercise 7.2.3 An Alternative Posterior Simulator Consider the following 
MCMC algorithm for the stationary first-order Markov finite state model. At each 
step s, there are m substeps. Let p denote the jth row of P at the end of step s. 
At substep j of step s, define 


sj) — Th (s) (s—1) s—1 
A E S eee ls 


At substep j of step s, draw a candidate p} from a Dirichlet distribution with 
parameters aj, + NSi» Sery ajm F nin? and define 

s.j (s) (s) (s—1) s—1 
E A a 


Set p? = p* with probability 


min 4 | [ir P/r PSP, 1 
j=l 


and otherwise set p% = py” f 


(a) Show that the invariant distribution of this algorithm is the posterior distri- 
bution in the stationary first-order Markov finite state model. 

(b) Indicate why this algorithm might be more efficient than the independence 
Metropolis- Hastings algorithm described in this section. 
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Exercise 7.2.4 Provide a closed-form expression for the marginal likelihood in 
the nonstationary first-order Markov model with prior density (7.28)—(7.29), and 
likelihood function given by (7.27), with the random n; and nj; replaced by the 
corresponding observed n? and ni. [Hint: Review the derivation of (6.18) from 
(6.17).] 


7.3 MARKOV NORMAL MIXTURE LINEAR MODEL 


Section 6.4.2 introduced normal mixture linear models (6.34)—(6.35) to accommo- 
date a nonnormal disturbance term in the linear model. In that model the latent 
states $; are i.i.d., and conditional on each state the disturbance is normally dis- 
tributed. That model is attractive because it can approximate 1.i.d. disturbances with 
absolutely continuous distributions very well, by incorporating a sufficiently large 
number of states, while at the same time the posterior distribution can always be 
blocked into three components for Gibbs sampling using the elementary normal, 
gamma, and independent finite state distributions. 

Example 6.4.1 applied the normal mixture linear model in a situation in which 
the assumption of normality was easily overturned. In many time series applica- 
tions, however, the specification that disturbances are i.i.d. is undesirable. This is 
particularly so in the case of asset return modeling, introduced in Section 1.1.2. 
Not only are the sample moments of financial returns strongly inconsistent with 
a normal sampling distribution; these moments also appear to evolve slowly with 
time. For example, if y, is the return on a financial asset then sample correlations of 
ve and Were j are typically positive for small values of j, whereas the corresponding 
population correlations would be zero if {y,} were i.i.d. 

It is straightforward to generalize the normal mixture linear model to permit 
this behavior, by substituting the stationary first-order Markov finite state model 
of Section 7.2 with a single cross section (n = 1) for the independent finite state 
model of the latent state vector $ = (51,..., 57)’ used in Section 6.4.2. This idea 
dates at least to Lindgren (1978); early Bayesian treatments include Albert and 
Chib (1993a), McCulloch and Tsay (1994), and Chib (1996). In lieu of (6.34), 
from (7.18) we then have 


PIS = j |51 5i, SU <t—1), A] = PCS = j |5- =i, A) = pij. 


The type of behavior just described for financial asset returns would be exhibited if, 
for example, pi; > X} isi Pij for at least some states i, while the precisions h - h; 
differ substantially across those same states. From the parameters [p;;] define the 
Markov transition P as in (7.20). Conditional on S, the Markov normal mixture 
linear model is exactly the same as the normal mixture linear model. In particular, 
(6.35) provides the conditional pdf of y,, and the conditionally conjugate prior den- 
sities of £, h, h, and a continue to be (6.31), (6.32), (6.37), and (6.38), respectively. 
From Section 7.2, the rows of the transition matrix P have independent Dirichlet 


234 MODELING FOR TIME SERIES 


distributions in the conditionally conjugate prior: 


PCDils «++ Pits +++» Pim | A) x pi] | pp. (7.32) 
J#i 


The distinction between rı and r2 allows the prior to specify more and less plau- 
sible degrees of persistence, while retaining the interchangeability of the m states. 
These states remain interchangeable in the posterior distribution, as well, and the 
remarks about this same feature in the normal mixture model in Section 6.4.2 apply 
here also. 

The conditional posterior distributions of y’ = (e’, $’), h, and h continue to 
be (6.41), (6.42), and (6.43), respectively, exactly as in Section 6.4.2. In partic- 
ular, because $ is present in all of these conditional distributions, æ was absent 
in Section 6.4.2 and P is absent here. In Section 6.4.2 the conditional posterior 
distribution of p was Dirichlet. Here, the conditional posterior distribution of each 
row of P is Dirichlet, (7.31), but with the terms n5 and ny referring to the latent 
states on which the distribution is conditioned. Since there is a single time series, 
n = 5(S1, j), and X; Djin =T — 1. The principal new complication for 
posterior simulation introduced is the conditional posterior distribution of the latent 
states S. In the normal mixture linear model the states were conditionally inde- 
pendent, leading to the sequence of T independent finite state distributions with 
probabilities (6.44). In the Markov normal mixture linear model the conditional 
posterior kernel is 


p@ly.h,P,h,y’, X, A) œ ryh exp[—h - hy, (91 — ay, — B'X1)°/2] 


T 
- [ [pza tg? expla - hz (9: — az, — B%,)?/21 (1.33) 
t=2 


and thus 


PCS: ISG At), y, h, P, hy’, X, A) 
P15 Pina * hy!” expl— - hy, Or — os, — B'X,)°/2] (7.34) 


for t =2,...,7 — 1. (Expressions for t = 1 and t = T are slightly modified.) 
Draws for 5, could be made successively from (7.34), but that algorithm induces 
substantial serial correlation if, as is typically the case, p; >> )~ jai Pij for at least 
some states 7. 

A more efficient algorithm due to Chib (1996) draws S directly from (7.34), 
and yields several important functions of interest as byproducts. Consistent with 


our definition of Yy in Section 2.1, let S, = (5;,...,5;)’, further define Y' = 
(y;,.--, yr)’ and S = (‘;,...,S7)’, and extend the convention Yo = {Ø} to So = 


y7*! — ST+! — {Ø}. To render the notation more compact as well as to emphasize 
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the fact that this algorithm applies to first-order Markov mixtures of distributions 
generally, let 6’,, = (y’, h, h’) and denote 


PY lY% = j, Oai, A) x h7? expl—h- hj (9, — a; — B'X,)°/2]. (1.35) 


We may write and decompose (7.34) in the form 


T 
p(Sr | Y2, 041, P, A) = [| pG: 1 Y9, 5%, 041, P, A). (1.36) 


t=1 
For each of the T terms on the right side of (7.36), we obtain 


pS, | Y%,S't!, O41, P, A) x pG, Y9, SH | 641, P, A) 
= p(3,, Y?, Yt, S+! | 041, P, A) 
= p(y! SFY, O41, P, A)pG:, Y? | 041, P, A) 
oc p(Y"H! SS Y?, O41, P, A) (3; | Y2, 041, P, A) 


= pH, $715,541, Y2, 041, P, A) 
-Plia 15 Y2, 011, P, A) p(S, | Y2, 041, P, A) 


= p(y, S+? S41, ¥?, a1, P, A) 
. PSi41 | St P, A) pS; | Y?, NE P, A) 
x P(S |S P, ApS, | ¥?, Oar, P, A). (7.37) 
We exploit this decomposition to simulate S. The first of the two terms in (7.37) 


is simply p(S;41 = j |5; =i, P, A) = pij. The second term may be evaluated in a 
forward recursion beginning with 


p(s; = j | Yi, 041, P, A) x p(s) = j | P, A)pOy |51 = j, 041, A). 


Recall that p(s; = j | P, A) is the unconditional state j probability 7 ;(P), defined 
in Section 7.2, and (7.35) provides p(y? | 31 = j, 041, P, A). Note also that 


m 


PO? lOa, P, A) =) ripo 1H = j). (7.38) 
j=l 
Step ¢ of the recursion has two substeps. In the prediction step 


PCS, =j |Y; 041, P, A) = X pj < P(S-1 =i | Y? 04, A). (7.39) 


i=1 
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The name derives from the fact that, as a byproduct, we can produce the one-step- 
ahead predictive conditional density 


PO | Y?a ban P, A) = D> pG = j |Y? 0a, P, A) 
j=1 


-POr |Y? 5: = j, ôa, A). (7.40) 
In the update step 


p(s; =j | ¥?, 041, P, A) x p(s; = j | Y? 041, P, A) 
-PO Y? S5 = j, 041, A). (1.41) 


The name derives from the fact that this step updates the conditional time f¢ state 
probabilities produced at time t — 1 in (7.39), producing the filtered probabilities 
in (7.41), so called because they are a function of past values of the observables. 
Substituting the observed y; for y; in (7.40) provides the tth component of the 
likelihood function 


pO? 1Y? i 0a P, A) = > pG = j | ¥2 4, 0a P, A) 
j=l 


POr |Y? 5% = j, 0a, A), (7.42) 


and at the end of the recursion p(y° |041, P, A) is provided by the product of 
(7.38) and (7.42) evaluated for t = 2, ..., T. 

Drawing from p(Sr | Y, 041, P, A) is now straightforward. The last update 
step (7.41) provides p(s7 = j | Y$, 041, P, A), an m-state distribution. Then suc- 
cessive evaluation of (7.37) for t = T — 1,..., 1 provides the finite state dis- 
tributions for the other time periods. These distributions provide the smoothed 
probabilities for the states, so-called because they take into account observations 
made after the occurrence of each latent state as well as before. 


Example 7.3.1 Filtering and Smoothing in the Markov Normal Mixture Lin- 
ear Model (The online appendix contains annotated code and output for this 
example.) To appreciate some of the properties of this model, consider a hypothet- 
ical simple case in which there are m = 3 components, a single covariate x, = 1, 
and known parameter values 6 = 0, h = 1: 


0.95 0.03 0.02 0 1 
P = | 0.10 0.54 0.36 |, a= 2|, h={025 s (7.43) 
0.05 0.57 0.38 —3 0.111 


The invariant distribution corresponding to P is æ = (0.6154, 0.2308, 0.1538)’. 
For these parameter values E(y, | 3; = j, A) = 0 (j = 1, 2,3), and consequently 
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E(y, | A) = 0 and E(y, | Y;-1, A) = 0. For all i, pj2/p;3 = 1.5, and consequently 
pyi |S: = 2, A) = pO, | 5; = 3, A). State 1 characterizes periods of low volatil- 
ity; states 2 and 3, periods of high volatility. In a period of high volatility, a return 
to a period of low volatility is twice as likely in state 2 (when y; is usually positive) 
as it is in state 3 (when y; is usually negative). 

Although parameter values are known, the states are unobserved. Hence at any 
time T there is uncertainty about 5, (t < T). Consider the situation portrayed in 
Figure 7.1, in which T = 200. Panel (a) shows y?. The squares of these values, 
in panel (b), indicate clearly that there are alternating periods of low and high 
volatility. At time t, the r-step-ahead predictive density is 


P04; |¥?, A) = D> pG =i | ¥?, A) (7.44) 


i=1 


PCS = j IS =i, APO |S = j, A), (7.45) 


where p(Yir | Sit+r = j, A) is the normal density with mean œ; and precision 
h-hj, p(Sr = j |S; =i, A) is the element in row i and column j of P”, and 
p(s; =i | Y°, A) is given by (7.41). The latter filtered probabilities are shown in 
panels (c), (e), and (g) of Figure 7.1. In some periods ¢ the value of 5, is nearly 
certain. This is especially so when | y? | is large, exceeding about 2. In several 
periods, however, there is substantial uncertainty. This uncertainty is reflected in 
predictive densities (7.45), especially for r = 1. [As r > œ, p(Si4, =j | 5; = 
i) > mj for all i.] 

As time passes, much of the uncertainty about 5, is resolved by means of 
conditioning on future y?,; as well as past y?_; (j > 0). The smoothing filter 
(7.37) provides the conditional probabilities, displayed in panels (d), (f), and (h) of 
Figure 7.1. Whereas there are 118, out of 600, filtered probabilities between 0.10 
and 0.90 in Figure 7.1, there are only 72 smoothed probabilities in this range. The 
smoothed probabilities may matter for inference about the past in some applications, 
but they are irrelevant for prediction. 

Panel (a) of Figure 7.2 shows the unconditional density 


Pr | A) = Dox jpOr 5 =j, A), 
j=l 


which clearly reflects the negative skewness coefficient (—0.7166) and the excess 
kurtosis (1.4533) of the unconditional distribution. The predictive distribution 
P(i41 | Y?, A) varies considerably, depending on the filtered state probabilities 
p(s; =i | Y?, A), by means of (7.45). The remaining panels of Figure 7.2 provide 
some examples. In period ¢t = 15, the filtered probability of state 1 is nearly 1; 
yy, = —0.5056 and y?; = 0.0302. The predictive density is nearly identical with the 
first normal distribution in (7.43). Because yfọ = —6.2801, P (S49 = 3 | Yọ, A) = 1, 
and because y$) = 3.1759, P (S59 = 2 | Yo, A) = 1. Since pj2/pi3 = 1.5 for all i, 
the one-step-ahead predictive densities in panels (c) and (d) of Figure 7.2 are nearly 
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Figure 7.1. A hypothetical time series obeying a known Markov mixture of normals model: (a) observed 
values; (b) observed values squared; (c) state 1 filtered probabilities; (d) state 1 smoothed probabilities; 
(e) state 2 filtered probabilities; (f) state 2 smoothed probabilities; (g) state 3 filtered probabilities; 
(h) state 3 smoothed probabilities. 


identical. In periods t = 68 and t = 69 the filtered probability of state 1 is not close 
to either zero or one. The consequence is that p(y;+1 | Y?, A) is a weighted average 
of P(Y 15 = 1, A) and p(y | 5; = 2 or 3, A). 

The conditional mean of y, is always zero, and the conditional skewness is 
always negative. Conditional excess kurtosis can be positive (when the filtered 
probability of state 1 is large) or negative (when it is small); unconditionally, it is 
positive. These features are consequences of the specific parameter values in (7.43). 


Example 7.3.2 The Markov Mixture Model and Value at Risk (The online 
appendix contains data, annotated code, and output for this example.) Recall the 
value at risk decision problem introduced in Section 1.1.2. The price of an asset or 
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Figure 7.2. Some conditional densities for the hypothetical Markov mixture of normals time 
series: (a) unconditional predictive density; (b) t = 15, p = (0.975, 0.018, 0.008); (c) t= 49, p= 
(0.000, 0.001, 0.999); (d) t =50, p = (0.001, 0.939, 0.060); (e) t = 68, p = (0.357, 0.544, 0.099); 
(f) t = 69, p = (0.716, 0.168, 0.117). 


portfolio on day ¢ is p;. For a date or dates t* > t, the decisionmaker must state a 
value at risk v; + such that P(p; — py > vs) = .05. The probability is, of course, 
conditional on the information and data available. Letting r; „+ = log(p;«/p;) denote 
the return between day ¢ and day ¢*, an equivalent problem is to find a return at 
risk w; such that P (r;e < —w 1) = .05. 

This example illustrates the process of finding return at risk, conditional on a 
single series of returns and a Markov normal mixture linear model. The asset is 
the Standard and Poors (S&P) 500 stock price index, for the period March 23, 
1978 through December 7, 1984, sample size T = 1700. This is the 9th of 10 
subsamples of a longer series of the S&P 500 index used by Ryden et al. (1998) 
in an investigation of the ability of Markov mixture models to account for several 
features of these data. The Markov mixture model has three states, a constant term 
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as its only covariate, and employs the prior distributions (6.31), (6.32), (6.37), 
(6.38), and (7.32) with 


ßB=0, h=106, 5510"; (7.46) 
h, = 1, v, = 3, rı = 10, and rp = 1. (7.47) 


Example 8.3.2 interprets this prior distribution. For comparison purposes only, we 
also consider a model in which returns are i.i.d. normal, utilizing the prior distri- 
bution introduced in Example 2.1.2 with the settings given in (7.46). 

The posterior simulator for the Markov mixture model ran 22,000 iterations, 
the first 2000 of which were discarded. The analysis that follows uses every 
20th iteration, for a total of 1000 simulated values from the posterior distribution. 
The alternative normal model posterior simulator ran 1100 iterations, with analysis 
based on the last 1000. Panel (a) of Figure 7.3 shows the logarithm of the posterior 
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Figure 7.3. Some aspects of the Markov mixture model applied to 1700 daily returns of the S&P 500 
index return: (a) unconditional pdf; (b) posterior distribution of unconditional moments; log predictive 
densities on (c) July 2, 1984 and (d) August 3, 1984. 
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mean of the unconditional probability density function of returns in the Markov 
mixture model (heavier curve), together with the log posterior mean of the pdf of 
returns in the iid. normal model (lighter curve). Panel (b), which displays 200 
points from the posterior distribution of skewness and kurtosis, provides another 
perspective on the shape of the unconditional distribution in the Markov mixture 
model. It is mildly but decisively leptokurtic, displays little or no skewness, and is 
centered at a very small positive return. 

Panels (c) and (d) in Figure 7.3 show the logarithm of the predictive density for 
the next day’s returns, for two particular dates in the sample period. In the Markov 
normal mixture linear model predictive densities (heavier curves) vary from one day 
to the next, as emphasized in Example 7.3.1, whereas in the normal linear model 
these densities (lighter curves) are always the same as the unconditional pdf. On 
July 2 the Markov normal mixture model indicates substantially more uncertainty 
about the next day’s return than does the normal model, and careful inspection of 
the figure in panel (c) shows that the Markov normal mixture predictive distribution 
is also slightly skewed to the left. On August 3 the distribution of the next day’s 
return is more closely aligned in the two models. The predictive density in the 
Markov normal mixture model is distinctly leptokurtic, and careful inspection of 
the figure in panel (d) shows that the distribution is slightly skewed to the right. 

From the posterior simulation output, there is a state assignment a for the last 
date in the sample at iteration m. State assignments can be generated recursively 


for future dates T + j, according to PC =i eee P™, A) = p% ui 
random sample yS ee ... can then be generated from these assignments and 


the other simulated parameters y™, h™, and h™. The simulated j-day-ahead 
return is then )/_, ys These returns can be simulated several times for each 
parameter vector drawn from the posterior simulator. Sorting the simulated returns 
over all iterations and simulations then provides the return at risk. Figure 7.4 was 
constructed in this way from the 1000 drawings from the posterior distribution and 
100 simulations for each drawing. 

This figure indicates return at risk using a probability .05 [panels (a) and (b)] 
as well as .01 [panels (c) and (d)] for total returns up to 10 business days after 
the dates indicated. The unconditional distribution of returns in the i.i.d. Gaussian 
model implies that return at risk j days in the future is always the same, depending 
only on j. Consequently the lighter curves in the left and right panels are identical. 
In the Markov normal mixture linear model return at risk j days in the future is 
always changing, but as j — oo, the predictive density for yy; approaches the 
one shown in the panel (a) of Figure 7.3, for all days T on which predictions are 
made. This is reflected in returns at risk, which must approach the same value 
at long horizons regardless of the state probabilities at time T. This is evident in 
Figure 7.3. The one-day-ahead return at risk is higher on July 2 than on August 3, 
reflecting the predictive densities shown in the panels (c) and (d) of Figure 7.3, but 
10-day-ahead is nearly the same for the 2 days. By implication, return at risk rises 
more rapidly with lengthening horizon starting from August 3 than it does from 
July 2. 
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Figure 7.4. Return at risk [(a) p = 5%, 7/2/84; (b) p = 5%, 8/3/84; (c) p = 1%, 7/2/84; (d) p = 1%, 
8/3/84] in the Markov normal mixture model (heavier curve) and i.i.d. normal model (lighter curve). 


Exercise 7.3.1 There are a number of important technical details that underlie 
the work in Example 7.3.2 that can be appreciated only by using the code in the 
online appendix. 


(a) From the output of the posterior simulator, look for examples of “label 
switching.” Label switching is indicated by the same, sudden permutation 
of the parameter vectors œ, h and of the rows and columns of the matrix P 
between successive iterations. Note that this has no impact on any function 
of interest that depends only on future values of the time series to which 
the model is being applied. 

(b) Using the methods of Section 4.7, illustrated in Example 5.1.1, test for con- 
vergence of the posterior simulator. 


Exercise 7.3.2 Real decisionmakers are likely to want to assess the sensitivity 
of conclusions to assumptions in the model or models used. 
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(a) Repeat the analysis of Example 7.3.2 using different numbers of states (m). 
How sensitive are the findings about return at risk in Figure 7.4 to this 
specification? 

(b) How sensitive are these results to the specification of the hyperparameters 
of the prior distribution (7.46)—(7.47)? (Before answering this question, you 
may wish to consult Examples 8.3.2 and 8.3.4.) 


Exercise 7.3.3 The online appendix contains a much longer set of returns from 
the S&P 500 stock price index, and the code indicates how to reconstruct the 
subsamples used in Ryden et al. (1998). 


(a) Repeat the analysis in Example 7.3.2 for some of the other nine subsamples 
used in Ryden et al. (1998). Are the findings similar for these other periods? 

(b) Repeat the analysis in Example 7.3.2 using the entire sequence of S&P 500 
returns but with a larger number of states. Interpret the results in the context 
of the results of part (a). To what extent is there a tendency for states to occur 
in one part of the sample (e.g., the 1930s or 1980s) and never reappear? 


CHAPTER 8 


Bayesian Investigation 


Multiple stakeholders have competing interests in the decisions that motivate 
Bayesian inference. Investigators, sometimes working on behalf of stakeholder 
clients, will carefully examine formal models used to inform these decisions. If a 
model is taken seriously and is likely to bear on a decision, then its credibility, 
often as indicated by its implications for observables, will receive close scrutiny. 
So, too, will the sensitivity of these implications to changes in the specification of 
the model. It is likely that new models or major variants on existing models will 
be introduced to cope with specific features of the problem at hand, the decision 
being addressed, and their interaction. 

This chapter presents tools at the disposal of the Bayesian investigator party 
to this process. The simulation methods set forth in Chapter 4 are well suited to 
examination of models of the type discussed in Chapters 5 through 7. A seasoned 
Bayesian investigator can go well beyond the models in these chapters, using the 
simulation methods of Chapter 4 and the underlying insights of Chapters 2 and 
3, to construct the variants required for a specific decision problem. This process 
often involves combinations and syntheses of simpler models, such as those taken 
up in Chapters 5 through 7. 

The development of a new model, even if it is an apparently straightforward 
variant on an existing and thoroughly understood model, consumes resources. Good 
investigators must understand the implications of their complete models for observ- 
ables, and must be able to reflect the beliefs of their clients in the specification 
of these models. This chapter addresses this problem, in Section 8.3, by means 
of forward simulation: that is, drawing unobservables from a candidate prior, fol- 
lowed by observables drawn conditional on unobservables, followed by the vector 
of interest drawn conditional on both. Section 8.3 also details how the investigator 
can go some distance in ascertaining whether a model yields sensible implications 
for observables and vectors of interest, before setting up a posterior simulator. 

If a complete model yields sensible prior implications for observables and vec- 
tors of interest, the investigator may proceed to write a posterior simulator, a 
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task often more time-consuming than writing a forward simulator. This entails 
developing an algorithm like those taken up in Chapters 5—7, and executing the 
algorithm with suitable computer code. Section 8.1 develops tests that should be 
applied at the end of this process, before taking code to the data and problem at 
hand. These tests have substantial power against logical errors that may arise any- 
where following the specification of the model, including expression of densities 
and conditional densities and their embodiment in computer code. 

Given posterior simulators for alternative models, the investigator may turn to 
formal comparison of these models using Bayes factors and marginal likelihoods 
as discussed in Section 2.6. Section 8.2 describes several ways in which posterior 
simulators can be used to approximate marginal likelihoods or Bayes factors. 

The process of examining the sensitivity of the results that count—the “bottom 
line” of uncertainty about the vector of interest—to the necessary but intermediate 
steps of specifying prior and observables distributions is one that often involves 
both the investigator and the client. Section 8.4 provides methods that enable a 
Bayesian investigator, perhaps working with a complex model, a sophisticated 
posterior simulator, and a very fast computer, to communicate results in a fashion 
that permits clients with spreadsheet software and laptops to manipulate some of 
the specifications of the model and examine the implications for vectors of interest. 
A particular client of interest in this process is the remote client—for example, 
an anonymous reader of the investigator’s published work. Section 8.5 shows how 
the Bayesian investigator can facilitate this process. 


8.1 IMPLEMENTING SIMULATION METHODS 


A complete model A provides a prior density of unobservables p(04 | A), a con- 
ditional density of observables p(y | 04, A), and a conditional density of a vector 
of interest, p(w | 04, y, A). It is usually straightforward to simulate from each of 
these distributions: 


0” ~ p04 |A), (8.1) 
yn mM ply | 0a, A), (8.2) 
wo” ~ plø |y,04, A). (8.3) 


If 04 = 00” in (8.2), then y™ ~ p(y | A). If y = y™ and 04 = 0” in (8.3), then 
wo” ~ p(w | A). 

As a matter of research strategy, the relative ease of constructing the simula- 
tors (8.1), (8.2), and (8.3) suggests that this be done before undertaking the more 
challenging task of constructing a posterior simulator oh ~ p(04 | y°, A). The 
forward simulator can reveal interesting and relevant properties of the model—in 
particular, its ability to account for salient features, œ, of observables. It can 
indicate the suitability of the model for the purposes at hand, for example, its 
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ability to replicate important features of the data, as discussed in greater detail in 
Section 8.3. 


8.1.1 Density Ratio Tests 


It is also generally straightforward to express the densities that appear on the 
right sides of (8.1)—(8.3), and to write code that evaluates these densities. This 
is a requirement, in any event, in order to compute or approximate the marginal 
likelihood of the model. There is an intimate relationship between the simulators 
(8.1)—(8.3) and the evaluation of the corresponding densities, which is useful in 
checking their derivation as well as their expression in computer code. 


Theorem 8.1.1 Density Ratio Test Suppose that {x")} is an ergodic process 
with unique invariant density p(x | Z) with respect to a measure v, having support 
X CR". Let k(x | J) be any kernel of this probability density, and c; = f, y K(x | 
I) dv(x). Let f be any probability density with respect to v having support X* C X. 
Then 


M 
MANS > fR™)/K&R™ | D> cF. 


m=1 


Proof: For g(x) = f(x)/k(x | I), we have 
E[g(x) | 1] = [ seve | 1) dv(x) = c7' [ scoeex | 1) dv(x) 


=o" f f (x) dv(x) = cF". 


Since c7! is finite and {x} is ergodic, M~! Y% g(x™) 3 c7! C 


In the particular case k(x | I) = p(x | I), we obtain 


M 
M'Y fx) /pa™ | D1. (8.4) 


m=1 


Theorem 8.1.1 provides a basis for testing simulators and density evaluations. If 
both have been derived and coded correctly, then (8.4) must hold. It is, of course, 
not the case that (8.4) must be violated if there are errors. However, in most settings 
it is difficult to produce errors, even intentional ones, that leave (8.4) in tact. More- 
over the derivation and coding of a simulator is typically independent of the deriva- 
tion and coding of the evaluation of p(x); thus it is unlikely that the same error can 
enter each in a way that preserves (8.4). Note that “error,” here, subsumes every- 
thing from misconceptions to mistakes in derivations to “bugs” in computer code. 
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The less variation in f(x)/p(x | I), the better is the approximation in Theorem 
8.1.1. If {x} is uniformly ergodic and var[g(x) | I] < 00, then Theorem 4.7.1 
may be invoked to provide the numerical standard error of the approximation 
M! Y, g(x). The variance requirement amounts to 


f [P09 / P&I Dl dv) < œ 


and will be satisfied if f (x)/ p(x | I) is bounded above on X*. There is a large class 


of cases in which such a density f(x) may be constructed from the simulations 
a) (M) 
Ky Egg 


Theorem 8.1.2 Constructing Density Ratio Tests Suppose that the n x 1 ran- 
dom vector x has an absolutely continuous distribution with mean pm, variance XZ, 
and probability density p that is bounded above as well as bounded away from zero 
on all compact sets A C R”. Suppose that {x} is ergodic with invariant density 
p(x | I), and denote the sample mean and variance of x™ (m = 1,..., M) by ph” 
and ©“. Denote the pdf of a multivariate normal distribution with mean uw” and 
variance ©”, truncated to its highest density region X“ of size 100(1 — a) %, by 

(M) (x) = 1s y—nl2 gT 
SP = A-a) E] 


-exp[—(x — p”) (E) 1x — pw) /21 Tan (x). 
Let k(x | I) = c; - p(x), where c; > 0, be a kernel of p(x | J). Then 


M 
MTS SORE | TD) cf! (8.5) 


m=1 


and 


limy_+oovar [r D fO ER) /kK™ | n| < o. 
m=1 
Proof: Let 
Xa = {xs Q- WET- u) < xn) 
and 
fa(&) = (1 -a 2r) E exp[—(x WET E — w)/21 x, 2). 
Given any € > 0, let 


XM = fx: |O Œ — fa(x)| > e}. 
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Since wh > p and E% 3 X, it follows that fun dx > 0. The results follow 


from the conditions that fy and p are bounded above and bounded away from zero 
on all compact sets. | 


Since the density f{” is constructed from the simulator output {x™}, the den- 
sity ratio test requires no further simulation. The conditions of Theorem 8.1.2 assure 
var[ f” (x)/k1(x | I) | I] < œ, and therefore Theorem 4.7.1 may be applied to 
assess the accuracy of this approximation. When k(x | J) = p(x | J), then c; = 1 
and we can formally test of the correctness of the simulator that produces {x} 
and the code that evaluates p(x | J). 

The compact support of f in Theorem 8.1.2, achieved by truncating a mul- 
tivariate normal density, is important because it bounds the ratio f/p. In many 
applications f/p will become unbounded as a — 0, and it is always the case that 
the accuracy of the approximation must deteriorate as œ — 1. Thus it may be 
prudent to conduct a density ratio test with several alternative values of œ. Other 
things the same, the greater the dimension of x, the greater the variation in f/p, 
and for sufficiently high dimensions the procedure may become impractical. This 
is generally not a problem in the density ratio test as applied in this section, where 
the size of x can be controlled as illustrated in Examples 8.1.1 and 8.1.2. How- 
ever, this consideration becomes important in Section 8.2.4 in the application of 
Theorems 8.1.1 and 8.1.2 in approximating the marginal likelihood. 


Example 8.1.1 Application of the Density Ratio Test to an Observables Density 
Consider the mean zero, normal first-order autoregression model, with stationarity 
imposed: p € (—1, 1), h > 0 and 


yı [A~ NO, hA- pT; yil X, A) ~ Noyi, h) (t =2,...,T7). 
(8.6) 


(This is a very special case of the linear model with serial correlation discussed in 
Section 7.1.) We can simulate observables from this model, given fixed values of 
p and h, without even thinking about the observables density, which is 


POL -Yr |h, p, A) = 2r) TPR PA — p) 


T 
-exp |- [a-e oon />| . (8.7) 


t=2 


To apply the density ratio test, construct f“)(-) as described in Theorem 8.1.2, 
with a = 0.5 and M = 10,000 simulations. 

Errors may be made either in simulating y” or in the evaluation p(y | 
h, p, A). As an alternative to correct simulation, suppose yı ~ N(0, h`!) in lieu 
of N[0, h~!(1 — p?)~'] in (8.6). As an alternative to the correct evaluation of 
the probability density, consider omission of the term nt? — oD! 2 (error 1) or 
yl — p°) (error 2) in (8.7). 
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The outcomes of the density ratio tests are 


Log [m= OM, F/P |h, p, A)| Shown: 
Standard Errors in Parentheses 


Density evaluation error None Error 1 Error 2 
Simulation error 

None —.006 (.010) .508 (.010) .259 (.011) 
Error —.342 (.011) .194 (.011) —.271 (.011) 


The tests easily detect the errors, and could have done so with M = 100 itera- 
tions rather than M = 10,000. The tests also appropriately indicate no error when 
both simulator and data density evaluation are correct. These tests were conducted 
using the single setting of parameters p = 0.8, h = 1. We might, of course, carry 
out tests for several alternative settings of parameters. The latter alternative extends 
density ratio tests to the case in which the normalizing constant of the likelihood 
function has been omitted, which often occurs in maximum likelihood estimation. 
Then the limit in (8.4) is not 1.0, but should be the same for different settings of 
the parameters. Testing this hypothesis is straightforward. 

In many instances the support of p(x | I) is a subset of R”. Theorem 8.1.2 may 
still be applied, if the random vector x is transformed so that its support is R”. The 
density of the transformed vector involves the Jacobian of transformation, which 
must also be derived and coded correctly. 


Example 8.1.2 Application of the Density Ratio Test to a Prior Density Con- 
sider the Wishart distribution of an m x m positive definite matrix A, with m x m 
positive definite matrix parameter X and degrees of freedom parameter v > m, 
denoted A ~ W(X, v). Section 5.2 introduced the Wishart as a conditionally con- 
jugate prior distribution in the seemingly unrelated regressions model, and provided 
its pdf in (5.17) followed by an algorithm for i.i.d. simulations A” ~ W(X, v). 

There are m(m + 1)/2 distinct elements of A. Since A is positive definite, the 
support of A is not R”"+)/?, and the distribution of these elements is seldom 
well approximated by a normal distribution unless v is quite large. The density 
ratio test is more efficient if applied to a transformation of A. Let A = CC’ be the 
Choleski factorization of A, in which C is lower triangular with positive diagonal 
elements, and use logarithms of the diagonal elements of C rather than the values 
themselves. The Jacobian of transformation for this new set of random variables, 
which we denote 0, is 2” T [4 aan 

To illustrate the application of the density ratio test to this distribution, take 
m = 4, v = 10, and 


4-1 2 0 
-1 3-2 1 
2-2 6 -I 


0 1-1 3 
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In one case the simulation is carried out correctly, whereas in the other b? ~ 
x? (v — i) in the second step of the Wishart simulation algorithm presented in 
Section 5.2. Consider two errors in evaluating the density: (1) the exponent v — 
1 —m appearing in (5.17) is replaced by v — m and (2) m™™™-D/4 in (5.17) is 
omitted. The outcomes of the density ratio tests using M = 10,000 simulations 
and a = 0.5 in (8.5) are 


Log |m S FOM po™)]| Shown; 


Standard Errors in Parentheses 


Density evaluation error None Error 1 Error 2 
Simulation error 
None —.003 (.011) 6.314 (.014) 3.423 (.010) 


Error —.240 (.014) 5.543 (.023) 3.185 (.013) 


As in the previous example, the results are clear. 


8.1.2 Joint Distribution Tests 


If there are alternative ergodic simulators {x} and {x} with the same invariant 
distribution, then 


M M 
M1) go™) -MY g(x) 0 
m=1 


m=1 


as long as E[g(x) | I] is well defined and finite. If var[g(x) | Z] is also well defined 
and finite, and the two simulators are independent, then a central limit theorem may 
be applied to test formally the proposition that the invariant distributions are in fact 
the same. 

This test does not require evaluation of the density p(x | Z). This is an advan- 
tage if we wish only to check the simulator, since it spares the effort of deriving 
and coding p(x | Z), as well as transformation of x to ensure support in all of R” 
if that is necessary. However, as we shall see in Section 8.2, it is necessary to 
evaluate the prior density p(04 | A) and the data density p(y? | 04, A) to approx- 
imate marginal likelihoods. Eventually, therefore, it is necessary to derive, code, 
and check the evaluation of the prior and data densities. Density ratio tests, but not 
joint distribution tests, are applicable at that stage. 

An important use of joint distribution tests is in checking the posterior simulator. 
The joint distribution of observables and unobservables is 


P@a,y| A) = p@a | A)ply | 44, A). 


We have already noted, below (8.3), that simulation from this joint distribution can 
be achieved by means of a marginal—conditional simulator 


oS ~ pO | A),y™ ~ ply | 0%”, A) (m=1,...,M). (8.8) 
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For any function g(@4, y) whose expectation with respect to the joint distribution 
of 6,4 and y is well defined and finite, we obtain 


M 
MSO." | f eOav ray! DWO E9 


m=1 


If var[g (04, y) | A] < œœ, then the Lindeberg—Lévy central limit theorem can be 
used to approximate the accuracy of the approximation M~! $ %_; ¢(0%”, y”) in 
(8.9), since the sequence oo”, y™} is iid. If g is a function of 0 4 but not of y, 
only the prior simulator is required. 

A posterior simulator generates the sequence CNAS) from a Markov chain C: 


zan qin—l) 
Bry ~ POA | Oxy y’ C). 


Consider the successive conditional simulator 


Je 


6, ~ p@a | A); 
~ Hm= 2 zan) g- 2 m 
y” ~ ply | 0,4 A), 0, ~p@alO, .y,C) (8.10) 
(m=1,..., M). 


If we have at hand a demonstration of the (uniform) ergodicity of Oy} for almost 


all y°, then, showing that 60”, Ş™ } is (uniformly) ergodic with unique invariant 
density p(@4 | A) p(y | 04, A) typically involves little, if any, additional work. In 
this case 


M 
MiS OP. 9) S [ s@s.997@a.¥| Adv(y)dv@,). (8.11) 


m=] 


The standard error of approximation in (8.11) can be assessed using Theorems 
4.7.1 and 4.7.3. 

The thought processes and coding in the simulations of {0% ,y™} and 
o”, Ş™} are nearly independent. The former involves the prior and observables 
simulators, and the latter involves the posterior and observables simulators. The 
observables simulator is common to both, but an error in this simulator will have 
different consequences for the invariant distributions in the two cases. Consequently 
a formal comparison of the left sides of (8.9) and (8.11) has power against error in 
the simulation of observables, as well as error in the simulation of unobservables 
from the prior or posterior. The simulator (0, y“} is a logical first step in under- 
standing the properties of a new model and in developing prior distributions in any 


event, as detailed in Section 8.3.1. The marginal effort in producing oo, yy} is 
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the addition of a few lines to generate artificial data in posterior simulation code. 
Thus the cost of making the comparison is relatively low. The gain is that subtle 
errors producing reasonable but incorrect results will likely be detected. 


Example 8.1.3 Joint Distribution Tests of a Posterior Simulator [This 
example appears in Geweke (2004).] To demonstrate the kinds of errors that these 
tests can detect in a research (as opposed to illustrative) problem, consider the 
Student-t mixture model 


Ye ~ t(l, hes à) with probability p, (8.12) 
Yt ~ t (M, hy A) with probability 1 — p. (8.13) 


In this problem å is fixed at à = 5, but the model could be extended to make à an 
unknown parameter as described in Section 6.4.1. This model, or one like it, can 
be used to model outliers (see Exercise 6.4.5), and models similar to this one can 
be used in financial applications similar to the financial decisionmaking problem 
described in Section 1.1.2, as illustrated in Example 7.3.2. 

As discussed in Section 6.4.1, we can exploit the fact that the sequence àh, ~ 
x7(A) followed by y, ~ Niu, (hyh,)~'] is equivalent to (8.12). To recapitulate 
briefly, the model is augmented with (hı,..., hr) and the latent state vector 
(51,..., 87), with 5; = 1 indicating (8.12) and 5, = 2 indicating (8.13). Then nor- 
mal priors for jz; and u, gamma priors for hı and h2, and a beta prior for p are 
all conditionally conjugate, and the resulting conditional distributions in a Gibbs 
sampling algorithm are also of these forms. 

This example uses two variants of the Gibbs sampler. In the first MCMC1), S; 
and h, are drawn jointly; in the second (MCMC2), they are drawn separately. In 


the simulator p% , y™} the five parameters are drawn from the prior, and then 


T = 6 observations are generated. The simulator 0o, y} follows (8.10), again 


using T = 6 observations. The joint distribution test is carried out using the 5 first 
and 15 second moments of the parameter vector 6, = (u1, H2, hi, h2, p)’. [Since 
y, is not involved in the comparison, it is really not necessary to generate y™ in 
the marginal—conditional simulator.] 

To gather evidence on the power of the posterior simulator joint distribution 
test, we introduce some errors. The first is an error in simulating from the prior: p 
is drawn from a beta(1, 1) distribution in the prior, whereas the posterior employs 
beta(2, 2) prior density. The second is an error in simulating the observables in the 
successive conditional simulator; the simulator ignores the h from the posterior 
simulator, and uses instead fresh values to construct y,. The third error is in the 
simulation of jz; and u, in the Gibbs sampler; they are set equal to their conditional 
means, with no allowance for their, conditional variance. In the fourth error the 
degrees of freedom in the draw of h, is 5, rather than its correct value of 6 in the 
conditional posterior distribution. The final error is in the generation of (%;, hy) in 
MCMCI. The correct algorithm generates $; (conditional on all unknowns except 
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hi) and then generates hy conditional on all unknowns including %5; just drawn. In 
the error, h; is drawn several steps later in the Gibbs sampling algorithm rather 
than immediately after ‘5;. 

For each of seven simulators, two correct and five incorrect, M = 250,000 val- 
ues are drawn from p(@4, y | A) using both the marginal conditional simulator (8.8) 
and the successive conditional simulator (8.10). In each case, the 20 moments are 
computed for each simulator, and a conventional equality of means test is applied, 
taking care to account for serial correlation in the successive conditional simula- 
tor (Theorems 4.7.1 and 4.7.3). Tests carried out at some alternative conventional 
significance levels produce the following results: 


Rejections (out of 20) at 


Algorithm Error p=.05 p=.01 p=.005 p=.001 
MCMC! 0. None 0 0 0 0 
MCMC2 0. None 0 0 0 0 
MCMC1 1. Prior simulation of p 4 3 3 2 
MCMC1 2. Simulation of y 10 9 9 9 
MCMC1 3. ñ; degrees of freedom 5 3 3 3 
MCMC1 4. u variance 11 10 10 9 
MCMCI _ 5. (5;, ħi) draw 7 6 6 6 


The correct algorithm clearly passes the joint distribution tests, whereas errors—in 
the prior, observables, or posterior simulators—are all flagged. 


Exercise 8.1.1 More Applications of Joint Distribution Tests In the course of 
her work, an investigator has created a new prior density p(@,4 | A) for a particular 
scalar parameter 04. She has developed an algorithm for direct sampling from this 
distribution, and has written the corresponding software. 


(a) Suppose that the investigator also has developed and coded an algorithm 
that evaluates the cdf of this distribution. How might she conduct a joint 
distribution test of the correctness of both her cdf evaluation and her direct 
sampling algorithm? (Hint: She can also simulate directly from a uniform 
distribution.) 


(b 


wa 


Suppose instead that the investigator also has developed and coded an algo- 
rithm that evaluates the inverse cdf of the distribution in question. How 
might she conduct a joint distribution test of the correctness of both her 
inverse cdf evaluation and her direct sampling algorithm? 


Exercise 8.1.2 Density Ratio Test for Importance Sampling Theorem 8.1.1 
assumes that the simulation algorithm is a Markov chain. Suppose, instead, that 
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the algorithm uses importance sampling. State and prove a variant of Theorem 
8.1.1 appropriate to this situation. 


8.2 FORMAL MODEL COMPARISON 


Model comparison can entail model averaging or hypothesis testing, as discussed in 
Section 2.6.1. In either case, given a set of models A;,..., Aj, the essential compu- 
tational task is to approximate the marginal likelihoods p(y? | Aj)(j =1,..., J). 
Analytical evaluations are possible only in very special cases, generally requiring 
fully conjugate prior distributions; instances include the linear model (Examples 
2.3.2 and 2.3.3) and the first-order Markov finite state model (Section 7.2). 

In most models, we must compute a good approximation to the marginal likeli- 
hood. A key difficulty is that the marginal likelihood cannot be expressed directly 
as a posterior moment, and consequently the problem cannot be treated directly as a 
special case of the simulation-consistent approximation of posterior moments devel- 
oped in Chapter 4. There are specific cases in which the approximation of a Bayes 
factor is simply a special case of the approximation of a posterior moment. One of 
these will be important subsequently in Bayesian communication (Section 8.4) and 
robustness analysis (Section 8.5), and so we develop it here (Section 8.2.1). In the 
more general case there are methods specifically tailored to the kind of simulator 
used; here we examine the cases of importance sampling (Section 8.2.2) and Gibbs 
sampling (Section 8.2.3). 

The computation or approximation of Bayes factors is an important current 
research topic, and this section does not include all approaches to this prob- 
lem. In particular, Tierney and Kadane (1986, 1989) developed an approximation 
of the marginal likelihood based on Laplace expansions, and Green (1995) has 
developed simulation methods that treat several models simultaneously, with the 
number of drawings from each model proportional to its posterior probability. 
These approaches have proved very effective in some applications, and less so in 
others. The surveys of Carlin and Chib (1995) and Han and Carlin (2001) provide 
accessible introductions to these and other approaches to formal model comparison 
not discussed in detail here. 


8.2.1 Bayes Factors for Modeling with Common Likelihoods 


Suppose that the models A; and A, share the same conditional probability density 
of observables, p(y |04, A), but have different prior densities, p(@4 | Ai) and 
p(0a | Az). The Bayes factor in favor of A> is 


f P(O4 | Ap |04, A) dv(Oa) 
A eo o aa a (8.14) 
| pOa | A1)ply® |04, A) dv(@Oa) 

x, 
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If @,4, C O,g,, then (8.14) may be expressed 


[pa | A2)/p@a | Ai)Ip@a | Ai)p(y® |04, A) dv@a) 


a, 


f POs | ADP |04, A) dv@a) 
a, 
= Elg@a) | y°, Ail (8.15) 


with g(@4) = p(0 4 | A2)/p@a | A1). 
The posterior moment in (8.15) is well defined and finite. Given a posterior 
simulator ow” ~ p(O4 | y°, A1), it may be approximated consistently by 


M 
MY pO | A2)/p@S” | Ai). 


m=1 


If the posterior variance of g(04) is finite and the simulator is uniformly ergodic, 
then Theorem 4.7.1 can be the basis for evaluating the numerical accuracy of the 
approximation. A sufficient condition for finite variance of g(04) is that the ratio 
of prior densities p(04 | A2)/p(0 4 | A1) be bounded on ©4,. 


Example 8.2.1 Changing the Prior in the Normal Linear Regression Model 
Consider a normal linear regression model with independent priors B | A; ~ N 
(B, H~!) and s?h | A; ~ x7(v). Denote the corresponding prior pdf by p(B, h | 
A;). Suppose that the Gibbs sampler has been used as a posterior simulator as 
described in Example 4.3.1 and the simulated values {£ uoa h} are available for 
the approximation of posterior moments. An investigator or client entertaining a 
different prior density p(B, h | A2) can approximate the Bayes factor in favor of 
their model by 


M 
MY p(p™, h™ | Ar)/p(B™, h™ | Ai). (8.16) 


m=1 


The approximation is simulation consistent. If p(B, h | A2)/p(B, h | A1) is bounded 
above then the numerical accuracy of the approximation can be evaluated using a 
central limit theorem. For a specific example, see Exercise 8.2.2. 


8.2.2 Marginal Likelihood Approximation Using Importance Sampling 


Suppose that p(04 | S), with support ©4, is the probability density function (not 
just a kernel) of an importance sampling distribution for the posterior density 
P(O4 | y°, A) x p(O4 | Ap? | 04, A). Denote the corresponding weight function 
w(04) = p(O4 | A)p(y? | 64, A)/p(04 | S), as in Section 4.2.2. Then 
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M 
T” = MY wog) sf w(0a)p(0a | S)dv(Oa) 
m=1 94 


-=f pOa | Ap |04, A) dv) = ply’ | A) =, 


OA 


the marginal likelihood for model A. If w(@,4) is bounded, then 


M 
MPT” T) ae NO, 2), M7! F wo) -TMP S r. 


m=1 


The first published application of this idea appears to be Geweke (1989b); see also 
Gelfand and Dey (1994) and Raftery (1996). 

With slight modification this method may be applied with acceptance sampling 
or an independence Metropolis chain as well. In the former case, for all candidates 
0%, keep a running sum of w(0%) = p(0%, | A)p(y? | 0%, A)/p(O%, | S), and take 
W” to be this sum deflated by the total number of candidates drawn. For an 
independence Metropolis chain the procedure is identical, except that q(6% | H) 
replaces p(0% | S). At each iteration the running sum is incremented by w(0%), 0% 
being the candidate. 

For all of these algorithms, the only incremental effort beyond what is otherwise 
required is that p(@,4 | S) must be normalized to a density (not just a kernel), and 
the posterior density kernel must be expressed in standard form [recall (2.8)]. 


8.2.3 Marginal Likelihood Approximation Using Gibbs Sampling 


In the case of the Gibbs sampler, an entirely different procedure due to Chib (1995) 
can provide quite accurate evaluations of the marginal likelihood, at the cost of 
additional simulations. Given the blocking 60% = (0/4,1),---, 9 /4:g)), Suppose that 
the conditional probability density functions p(@ 4) | @4—@), Y°, A) can be evalu- 
ated in closed form for all blocks b. [This latter requirement is generally satisfied for 
a pure Gibbs sampler. For an extension of this method to the Metropolis—Hastings 
algorithm, see Chib and Jeliazkov (2001).] 
From the identity p(04 | A) p(y? | 94, A) = p(y’ | A)p(04 | y’, A), we have 


Ply? | A) = p@% | APO | 04, A)/PO% Ly’, A) (8.17) 


for any fixed 0%, € Ox. Typically p(0%, | A) and p(y? | 0%, A) can be evaluated in 
closed form but p(6% | y°, A) cannot. A marginal—conditional decomposition of 
p(0% | y°, A) is 
p0% ly’, A) = PO Aa) ys Ap Ožio | Ohar y’, A) (8.18) 
aea POnz) | 0% <(B)> y’, A). 
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The first term in the product of B terms on the right side of (8.18) can be approx- 
imated from the output of the posterior simulator because 


M 
M~ No POA, | One ay) > pO aay ly’, A). 
m=1 


To approximate POA) | Bi <(b)> y°, A), first execute the Gibbs sampler with the 
parameters in the first b — 1 blocks fixed at 0%(,),...,0%q_1)- This provides a 


sequence 0,20) from the conditional posterior distribution. Then 


M 
a * b(m o a.s. % o 
M'Y Pho lO y A) > O Ooy A 819) 
m=1 


for blocks b = 2,..., B — 1. The last term on the right side of (8.18) can be 
evaluated directly and requires no simulation. These approximations are then used 
in (8.18) and (8.17) to obtain the approximation to p(y? | A). 

This procedure is generally more efficient the larger is p(0% | y°, A), so it helps 
to choose 0% to be near the mode of the posterior density. Of course, we should 
get the same result, up to numerical standard error, for any choice of 0%. This 
property is also the basis of a test for accuracy and convergence of Gibbs sampling 
algorithms proposed by Zellner and Min (1995). 


Example 8.2.2 Marginal Likelihood Approximation in the Normal Linear 
Regression Model In using the Gibbs sampler in the normal linear regression 
model (Example 4.3.1) there are only two blocks. If we designate the draw for h to 
be the first block of the Gibbs sampler and the draw for B to be the second block, 
the additional computational burden is negligible. Since there are only two blocks, 
no auxiliary simulations are needed. 


Example 8.2.3 Marginal Likelihood Approximation in the Normal Mixture 
Linear Model Recall that the posterior density function in the normal mixture 
linear model (Section 6.4.2) is multimodal, a feature that makes it ill-suited to 
importance sampling, and can present problems for the density ratio marginal likeli- 
hood approximation method described in Section 8.2.4. Let 0 denote the parameters 
y, h, x and h. Then 


pO" | A)p(s* | 0", A) ply? |5*, 0", A) 
p(S* | y°, A)p(O* | S*, y°) 

Each of the three terms in the numerator of the right side of this expression can 

be evaluated analytically. The evaluation of terms in the denominator requires the 


output of the original posterior simulator and two conditional posterior simulators. 
The first term in the denominator 


py | A)= (8.20) 


P&I, A) -f p 10,y°, A)p@ Ly’, A) 
(S) 
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can be approximated by M7! a p(s* | 6”, A); note that the terms in this sum 
are available from the original posterior simulator. The last term in the denominator 
of (8.20) can be expressed as the product of the following four terms: 


p(x* |'s*,y°, A) = p(x* |'s*, A), 
p(y* | 2*,8*,y°, A) = p(y* |S", y’, A), 
ph“ | y*, 2", 8", y°, A) = p(h* | y*,5*, y°, A), 
p(h* | h*, y*, x*,5*, y°, A) = p(h* | h*, y*,8*, y’, A). 


The first and last terms can be evaluated analytically. A second run of the pos- 
terior simulator with $ = $* and the block in x omitted produces the simulations 


yi h!™, and h!'™ (m = 1,..., M), and the approximation 
M 
MYO p(y |, 3", y?, A) 
m=1 


of the second term above. A third run with $ = $* and y = y* produces the sim- 
ulations 42 and h?™ , and the approximation M~! )~_, ph* | h2,S*, y’, A) 
of the third term. 


8.2.4 Density Ratio Marginal Likelihood Approximation 


The marginal likelihood is the integrated posterior density kernel in standard 
form (2.8) 


py’ |A)= f PO | A)ply® |04, A)dv(0a). (8.21) 


Given the output of a posterior simulator, ow” ~ p(04 | y°, A), and evaluations of 
the prior density pow | A) and data density p(y? | 6” , A), we can use Theorem 
8.1.1 to approximate (8.21): 


M 
MY FOPO APY | 04”, A) > pP | AIT. (8.22) 


m=1 


In (8.22) the probability density f (-) can be constructed from the posterior simu- 
lator output as described in Theorem 8.1.2. The density ratio marginal likelihood 
approximation was proposed by Gelfand and Dey (1994), and the implementation 
with f(-) constructed from the posterior simulator output is due to Geweke (1999). 
The advantage of the method is that it is generic. It applies to any posterior sim- 
ulator, no matter what algorithm is used, and since it approximates the marginal 
likelihood, it can be applied in approximating Bayes factors for any two models 
that pertain to the same data y°. If the evaluation p(6"” | A)p(y? | 0%”, A) is 
recorded in each iteration, along with o”, the density ratio marginal likelihood 
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approximation can be computed after the posterior simulation has been completed. 
Computation time is typically negligible. Its limitation is that approximations can 
be poor when @, is of very high dimension, as discussed following Theorem 
8.1.2. This limitation applies to all the other known methods of approximating the 
marginal likelihood as well. 

BACC uses the approximation (8.22) for many models, and permits the user to 
specify the truncation parameter œ of the density fẹ of Theorem 8.1.2, as well as 
alternative values of L/M in the tapering function (4.46) for the computation of 
numerical standard errors. 

It is often the case that part of the integration in (8.21) can be carried out 
analytically, thereby greatly reducing the dimension of the space over which the 
simulation approximation must be made. Let 0/, = (0a) 0a), Where 0 ao) is of 
high dimension whereas the order of @4(1) is small. Suppose that 


P(Y | Aaa, A) = f P(A) | Pac), APO | Pac, ao, A) dvO ao) 


O42) 


can be evaluated analytically. Then the whole marginal likelihood evaluation prob- 
lem may be cast in terms of 0 4q) rather than 040). 


Example 8.2.4 Evaluating the Marginal Likelihood in the Probit Model The 
posterior simulator constructed in Section 6.2 has two blocks: £, the k x 1 vector of 
covariate coefficients, and y, the T x 1 vector of latent probits. A direct application 
of the density ratio method of marginal likelihood approximation would require 
tailoring f(-) to approximate the distribution of the (k + T) x 1 vector (p’, yY. 
For large T (applications with T > 1000 are common), this becomes impractical. 
Instead, we may exploit the essentially closed-form representation 


Í, ply’ |P, X, A)ply’ |F, B, X, A) d9 = Í, ply’ | B.X, Apy |F, A) dy 


T 
= p(y’ |B, X, A) =] [ po? |B. x, A) 


t=1 


T 
= | [PEx "(Bx)" (8.23) 


t=1 


where y? = 0 for the first outcome and y? = 1 for the second. (The univariate 
normal cdf ® cannot be expressed in closed form but can be computed to machine 
accuracy very rapidly.) We record the product of (8.23) and p(B” | A), along 
with B®, each iteration, and then apply the density ratio approximation method 
over the k-dimensional space. 


Exercise 8.2.1 Completing the Argument Derive (8.19). 
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Exercise 8.2.2 Bayes Factor with Common Likelihoods Example 5.1.1 em- 
ployed a conditionally conjugate prior distribution in which B ~ N(B,H'). 
Beyond the original specification of the prior, the example considered a “weak” 
prior that reduced H by a factor of 5, and a “strong” prior that increased it by a 
factor of 5. 


(a) Letting A, denote the model with the original prior and Az the model with 
the strong prior, use the method of Section 8.2.1 to approximate the loga- 
rithm of the Bayes factor in favor of the model with the strong prior. Find the 
numerical standard error of approximation, and verify that the approximation 
is consistent with the exact result reported in Example 5.1.1. 


(b 


wm 


Repeat part (a), but reverse the roles of Aj and A2. Use BACC to compute 
the numerical standard error of the logarithm of the Bayes factor. Then, 
increase the number of iterations used by a factor of 10, and again compute 
the numerical standard error. Discuss. 


Exercise 8.2.3 Marginal Likelihood Approximation Using Importance Sam- 
pling Consider the importance sampling algorithm for the stationary first-order 
Markov finite state model described in Section 7.2.2, and let w(P™) = 
Tj 7 j(P™)") denote the weight at iteration m. 


(a) Show that an almost sure limit w of M~! ~”_, w(P™) exists. 

(b) Derive a closed-form expression for the marginal likelihood of the stationary 
first-order Markov finite state model, expressed in terms of w, the prior 
hyperparameters a;j, and the sufficient statistics nọ). 

Exercise 8.2.4 Marginal Likelihood Approximation Using Gibbs Sampling 
This exercise is an extension of Example 5.1.1. Use the method of Section 8.2.3 
to approximate the log marginal likelihood of the model in Example 5.1.1, and 
find the numerical standard error of this approximation. Verify that the result is 
consistent with the exact value reported in Example 5.1.1, and that its numerical 
standard error is smaller than the numerical standard error of .0030 for the density 
ratio approximation of the log marginal likelihood reported in that example. 


Exercise 8.2.5 Inference about Shape Exercise 5.4.3(b) found the posterior 
probability that the shape constraints were valid over the range of the data (11.4 
to 27), conditional on the regression function for the student : teacher ratio being 
polynomials of order 2, 3, 4 or 5. Consider, instead, models that constrain the 
function to have this shape by appropriate truncation of the prior distribution used 
in Example 5.4.3 and Exercise 5.4.3(b). Find the posterior odds ratio in favor of 
this model, versus the model of Example 5.4.3 and Exercise 5.4.3(b). Contrast the 
result with the posterior probabilities found in Exercise 5.4.3(b). (Hint: Consider 
simulating from the prior distribution of Example 5.4.3 in order to find the required 
normalizing constant for the prior density in the new model.) 
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8.3 MODEL SPECIFICATION 


Creating a complete model entails specification of the prior density of unobserv- 
ables 


PO, | A), (8.24) 
the conditional density of observables 


pP | 94, A), (8.25) 


and the composition of a vector of interest œ and its conditional density 
P@ | O4,y, A). (8.26) 


To this point we have taken as given all of these distributions. Yet a sensible specifi- 
cation of all three is essential to the investigator’s task of informing and improving 
decisions. To a substantial extent, success in this endeavor arises from the inves- 
tigator’s creativity, experience, skill, and understanding of the client’s situation. 
These characteristics cannot be endowed analytically. However, the investigator 
can employ certain systematic procedures that are very useful tasks in the process 
of creating and improving models. These procedures are known collectively as 
Bayesian specification analysis. 

It is useful to distinguish between two kinds of specification analysis. Prior pre- 
dictive analysis takes place when the investigator is considering alternative variants 
of (8.24)—(8.26). It requires only forward simulation. Posterior predictive analysis 
takes place after the investigator has conditioned on the data y® and is consid- 
ering possible changes in the complete model. It requires a posterior simulator. 
We distinguish between prior and posterior specification analysis in part because 
the former is less costly than the latter; if a prior predictive specification analysis 
indicates serious problems, it may be prudent to consider alternatives to (8.24), 
(8.25), and (8.26), before undertaking full implementation of the model as out- 
lined in Section 8.1. There is a substantial literature on both of these approaches. 
The seminal work of Box (1980), and the comments of discussants published with 
that paper, still provide deep and useful perspectives on specification analysis. 
For a similar more recent symposium, see Bayarri and Berger (1999) and their 
discussants. 


8.3.1 Prior Predictive Analysis 


The objective of prior predictive analysis is to ascertain the prior distribution of 
functions of the form h(y, œ) that are interesting and relevant to the problem. This 
can be accomplished through the forward simulation 


6” ~ pO | A), (8.27) 
y™ ~ ply | 0%”, A), (8.28) 
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o™ ~ p(@| 0%”, y™, A), (8.29) 
h™ = hy”, ow). (8.30) 


The forward simulation of h may reveal deficiencies in the model, in the sense 
that the distribution of A (y, œ) coincides poorly with beliefs about this function. For 
instance, in the value at risk example, the function h might be the sample correlation 
coefficient between squared returns on successive days over a hypothetical sample 
of (say) 100 days’ duration. If the prior distribution of h places probability .99 on 
the interval (—.01, .01), most clients would regard the model as an unreliable basis 
on which to assess value at risk, given the well-documented persistence in squared 
returns on financial assets. They would hesitate to proceed with such a model, 
because it does not reflect their prior beliefs. More generally, prior predictive 
analysis interprets the model specification—both the distribution of observables 
conditional on parameters, and the prior distribution of parameters—in terms of 
observables and vectors of interest that are usually easier to understand than the 
parameters themselves. 

If the prior predictive analysis reveals deficiencies, then the investigator may 
revise (8.24), (8.25), or (8.26), or may decide to begin anew with a completely 
different specification. An investigator with strong insights into the substance of 
alternative models and the beliefs of clients may find that the prior distribution of 
h provides useful clues in revising the model or constructing a new one. 


Example 8.3.1 Prior Predictive Analysis in the Earnings Example Recall the 
distribution of earnings conditional on age and education studied in Example 6.4.1 
using the normal mixture linear model. In that model the disturbances ¢, = y, — 
B’x, are independent and identically distributed. Conditional on a latent state 5, = j, 
e: ~ Nia;, (h. hj)'] (j =1,...,m). (Section 6.4.2 provides the complete speci- 
fication of the model.) The prior distribution in Example 6.4.1 had five independent 
components: 


XB ~ N[10.57, (0.7T)I7], (8/3)h ~ x2(10), 
æ; ~ N(0,2.5), 3h; ~ x7), p {~ Dirichlet (1,..., 1). (8.31) 


Example 6.4.1 considered the cases of mixtures of m = 2 and m = 3 normal distri- 
butions. Since the Bayes factor favored the mixture of three normals, this example 
pertains to that case. 

The prior distribution (8.31) can be understood through its implications for 
observables y by means of various functions h(y). The observables we consider 
here are the log earnings in a population of T men with the same ages and levels 
of education as in the sample—that is, the T x 1 random vector y. Functions h(y) 
can be chosen to summarize various aspects of y. Thus the forward simulation 
(8.27)—(8.30) amounts to drawing a set of parameters from the prior distribution and 
then, using the T x k covariate matrix X, simulating values of the corresponding 
T x 1 log earnings vector y, and then finding the corresponding functions h(y). 
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These functions h(y) can be chosen to study the implications of the prior distri- 
bution for the regression of log wages on age and education. One such function is 
the difference between the average log earnings of college graduates (e = 16) and 
high school graduates (e = 12) in a population of men defined by the covariates in 
the sample; of course, these men do not all have the same age, and the distribution 
of ages within each of the two groups tends to differ. Another is the difference 
between the average log earnings of high school graduates and men with 8 years of 
education. Turning to systematic differences by age, one function is the difference 
between the average log earnings of men age 45 and those age 25, while another is 
the difference between the average log earnings of men age 60 and those age 45. 

The prior distribution expresses the variety of conditional distributions of log 
earnings, as well as conditional means, that are plausible. We cannot observe these 
conditional distributions, but we can observe closely related sample counterparts. 
A sample counterpart of the population conditional distribution coefficient of skew- 
ness, for example, is the sample skewness coefficient of the least-squares residuals 
(LSRs) in the population. Construction of this function hamounts to drawing param- 
eters from the prior distribution, and then for each drawing simulating the log 
earnings sample y, computing the least-squares residuals, and forming the sample 
skewness coefficient of the residuals. Similar exercises can be conducted for the 
coefficient of kurtosis and for the inequality measures G, P, and R defined in 
Example 6.4.1. 

The results of this prior predictive analysis can be summarized in several ways. 
One useful summary is the following table, which indicates the observed value 
h(y°), together with the fraction of 2000 draws from the prior distribution that 
were less than h(y°) for each function h. For each function h a centered 95% 
prior credible interval includes the observed value h(y’). In this sense, the prior 
distribution accommodates the characteristics of the data well: 


h(y’) Po'[h(y?) | A] 
Average(y | e = 12) — average(y | e = 8) 0.610 0.587 
Average(y | e = 16) — average(y | e = 12) 0.491 0.645 
Average(y | a = 45) — average(y | a = 30) 0.459 0.579 
Average(y | a = 60) — average(y | a = 45) —0.242 0.478 
Variance of LSRs 0.539 0.289 
Coefficient of skewness, LSRs —0.974 0.061 
Coefficient of kurtosis, LSRs 6.220 0.904 
Gini coefficient, LSRs 0.346 0.186 
Proportion below half median, LSRs 0.148 0.309 
Earnings fraction to highest decile, LSRs 0.266 0.223 


The summary in the foregoing table proceeds one dimension at a time. We 
can work two dimensions at a time, as illustrated in Figure 8.1. Panels (a) and (b) 
indicate that the prior distribution of observed systematic differences in earnings by 
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Education 16 vs. 12 
Pa 
Age 60 vs. 45 


Education 12 vs. 8 Age 45 vs. 30 
(a) (b) 


= 


Coefficient of kurtosis 
1 
Fraction below half median 


Coefficient of skewness Gini coefficient 
(c) (d) 
Figure 8.1. In each panel the scatterplots show values of two functions of 200 independent draws from 
the prior—the observed value is the intersection of the horizontal and vertical lines: (a,b) differences 


in sample average log earnings; (c) moments from least-squares residuals; (d) inequality measures from 
least-squares residuals. 


age and education is extremely diffuse. They show that the polynomial specification 
is sufficiently flexible to simultaneously capture the observed difference in the 
earnings of college and high school graduates and the observed difference in the 
earnings of high school graduates and those completing eighth grade, in the sample, 
for example. What is important, here, as well as more generally, is that (8.24) and 
(8.25) do not appear to systematically exclude any plausible combinations. 

The prior distribution of the coefficients of skewness and kurtosis in the least- 
squares residuals indicates that the model is flexible in some ways; for example, 
it permits both platykurtic and leptokurtic distributions, including some that are 
highly leptokurtic. However it provides much less support at the observed values 
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than it does at values that are closer to the moments of a normal distribution. Panel 
(d) of Figure 8.1 is based on the exponentiated least-squares regression of log 
wages on the polynomial in age and education. As the Gini coefficient approaches 
zero, the fraction of men with earnings below half the median must also approach 
zero, but as the Gini coefficient increases, the prior allows substantial flexibility 
in the distribution of earnings. The distribution is diffuse but appears to provide 
greatest support to points near the observed values. 


The preceding example illustrates how an investigator makes sense of a complex 
model, including the prior distribution (8.31). The components of the model have 
meaning taken together, but not in isolation. The objective of prior predictive 
analysis is to summarize the model by means of functions h(y, @) to which the 
client can assign prior probability directly. [There is a substantial literature about 
prior elicitation that builds on this fact, e.g., Garthwaite and Dickey (1988), Kadane 
(1994), and Kadane and Wolfson (1998).] The investigator’s task is to create models 
(8.24) that reproduce the priors on h(y, œ) in as plausible a fashion as possible, 
given the choices of (8.24), (8.25), and (8.26) that are feasible. The next example 
applies this idea to the specification of the Markov normal mixture linear model in 
Example 7.3.2. 


Example 8.3.2 Prior Predictive Analysis in the S&P 500 Example Recall 
Example 7.3.2. In their application of the Markov normal mixture linear model 
to stock returns, Ryden et al. (1998) were motivated, in part, by the challenge of 
reproducing some stylized facts about observed returns on financial assets identified 
by Granger and Ding (1995a, 1995b). These characteristics include the following: 


1. Returns y, are not autocorrelated (except possibly at lag one). 


2. The autocorrelation functions of |y,| and y? decay slowly. The decay is much 
slower than the exponential ratio of an autoregressive model, like the one 
considered in Section 7.1. 

|ê 


. The correlation of |y;|? and |y;—1|? is highest when 6 = 1. 


. Correlations between sign(y,) and sign(y;_;) are negligibly small. 


. The correlation between |y,| and sign(y,) is negligibly small. 


Nn bh U 


. |y,| has the same mean and standard deviation. 


Note that these properties refer to observed returns, not their population coun- 
terparts. They indicate features that would be desirable in complete models of 
financial returns, and strongly suggest functions of observables that can be used 
to interpret and evaluate any complete model of financial returns before it is 
applied. The following table indicates the values of the sample statistics s° = h(y°) 
from the sample of Example 7.3.2 (T = 1700) and the corresponding point in the 
prior cdf of s = h(y) using the model and prior distribution described in that 
example. The evaluation of the prior cdf is based on 1000 draws from the prior 
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distribution, followed by a sample of size T = 1700 conditional on the parameters 
drawn. 


Property Observable s = h(y) ge P(s <s° | A) 
1 Corr (Yr, Yr—1) 0.1029 0.34 
1 corr(y;, Yr—2) 0.0338 0.17 
2 corr(|y:|, Ly—-11) 0.0531 0.20 
2 corr(|y:| , |Y+—-21) 0.0104 0.07 
2 corr(|y|, |y:-51) 0.1107 0.95 
2 corr(|yr], [y1 |)? /corr(ly:| , [y:-21) 0.2716 0.69 
2 corr(|y;| , ly-11)°/corr(yr| » I-51) 4x 1076 0.26 
3 corr(y?, y1) 0.0697 0.32 
3 corr(y?, y7_,)/corr(Lyel . Lyra) 1.3134 0.93 
4 corr[sign(y;), sign(y;—1)] 0.0728 0.23 
4 corr[sign(y,), sign(y;_2)] 0.0031 0.05 
5 corr[|y;|, sign(y,)] 0.0114 0.55 
6 Dir = oper lyel /T 0.0068 0.40 
6 SDUyl) = E2 dyl—Dlry2/ — 1) 0.0059 0.48 
6 lylr/SD(lyr)) 1.1589 0.28 

[E0 -FDAT = D|/SDd»D? 0.227 0.70 
[Shir -F/T =H] /SDUnd* 44595 0.70 


The complete model accounts well for each statistic, taken in isolation. These 
results, by themselves, say nothing about the ability of the model to account simul- 
taneously for two or more of these statistics, or of the stylized facts identified by 
Granger and Ding (1995a, 1995b) and Ryden et al. (1998). That possibility can be 
investigated using the graphical methods of the previous example. 


8.3.2 Posterior Predictive Analysis 


Consider the following conceptual experiment. We have an observed outcome y? 
from an experiment that can be repeated. Our vector of interest @ is observable out- 
comes in independent repetitions of the same experiment. The observed outcome 
can be summarized in any one of several ways, typically a scalar function h(y°), 
and the outcomes of repetitions of the experiment can be summarized in the same 
ways, typically h(w). The predictive density for these outcomes in repetitions of 
the experiment is p[h(@) | y°, A]. The observed A(y°), in the context of this dis- 
tribution, tells us much about the model A. It may turn out that the observed h(y’) 
is implausible, in the sense that h(y®) is not an element of a 100(1 — a)% highest 
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posterior density credible set for h(@) with respect to p[h(@) | y°, A]. If the pre- 
dictive density of h(@) is unimodal, this is equivalent to h(y°) being in the extreme 
tails of p[h(w) | y°, A]. This idea goes back to the notion of “surprise” discussed 
by Good (1956), and its essentials were further developed by Rubin (1984) in what 
he termed “model monitoring by posterior predictive checks.” This idea builds on 
that of the probability integral transform stressed by Dawid (1984) in prequential 
forecasting, and formalized by Meng (1994); see also the comprehensive survey of 
Gelman et al. (1996) and the more recent work of Bayarri and Berger (1999) and 
the references cited therein. 

This conceptual experiment is the motivation for posterior predictive analysis, 
whose mechanics are the same as those in (8.27)—(8.30), except that we replace 
(8.27) with o% ~ p(@4 | y°, A). In many cases the posterior predictive analysis 
may parallel the prior predictive analysis, and in this case the posterior predic- 
tive analysis involves almost no additional effort given the output of the posterior 
simulator. 


Example 8.3.3 Posterior Predictive Analysis in the Earnings Example A pos- 
terior predictive analysis of the earnings model of Example 6.4.1, using the same 
functions /(-) as in the prior predictive analysis of Example 8.3.1, yields the results 
shown below. The posterior cdf evaluated at the observed values is in the interval 
(0.05, 0.95) in every case. The four functions based on conditional means of earn- 
ings indicate observed values near the median of the posterior cdf in two cases, 
and in the upper quartile in two others. The six functions based on the least squares 
residuals, which indicate the ability of the model to capture the scale and shape of 
the conditional distribution, all yield observed values in the interquartile range of 
the posterior cdf. 


hY)  Pl[h(@) <h’) |y’, A] 


Average(y | e = 12) — average(y | e = 8) 0.610 0.940 
Average(y | e = 16) — average(y | e = 12) 0.491 0.474 
Average(y | a = 45) — average(y | a = 30) 0.459 0.889 
Average(y | a = 60) — average(y | a = 45) —0.242 0.552 
Variance of LSRs 0.539 0.442 
Coefficient of skewness, LSRs —0.974 0.619 
Coefficient of kurtosis, LSRs 6.220 0.275 
Gini coefficient, LSRs 0.346 0.519 
Proportion below half-median, LSRs 0.148 0.418 
Earnings fraction to highest decile, LSRs 0.266 0.514 


These findings are reinforced in the pairwise posterior predictive analysis in 
Figure 8.2, parallel to the pairwise prior predictive analysis in Figure 8.1. The 
scatterplots in panels (c) and (d) of Figure 8.2 indicate that the mixture of three 
normals model accounts well for non-normality, as reflected in the coefficients of 
skewness and kurtosis and measures of inequality in the least squares residuals. 
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Figure 8.2. In each panel the scatterplots show values of two functions of 200 draws from the pos- 
terior—the observed value is the intersection of the horizontal and vertical line: (a, b) difference in 
sample average log earnings; (c) moments from least-squares residuals; (d) inequality measures from 
least-squares residuals. 


This is consistent with the findings in the foregoing table. The scatterplot in panel 
(a) reveals that the two returns to education functions are nearly uncorrelated in the 
posterior distribution, and so this panel adds little to first two lines of the foregoing 
table. The scatterplot in panel (b) shows that the two returns to age functions are 
negatively correlated in the posterior distribution. As a consequence, the posterior 
distribution captures the difference in the returns over these two age ranges, but it 
is less successful in changes in log earnings between age 30 and age 60. 


This example illustrates how Bayesian specification analysis can be used to 
capture the implications of models for observables. In general, the goal of posterior 
specification analysis is to highlight inconsistencies between models and observed 
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data, thus increasing our understanding of models and sowing the seeds for the 
development of better models and improved decisionmaking. 


Example 8.3.4 Posterior Predictive Analysis in the S&P 500 Example Repeat- 
ing the analysis of Example 8.3.2, using the posterior distribution in place of the 
prior distribution, produces the following results. They are based on the 20,000 
iterations of the posterior simulator described in Example 7.3.2. 


Property Observable s = h(y) s? P(s < 8° | y?, A) 
1 corr(yr, Yr-1) 0.1029 0.964 
1 corr(yr, Yr-2) 0.0338 0.618 
2 corr(| yz], lyzl) 0.0531 0.141 
2 corr(|y:| , 1-21) 0.0104 0.014 
2 corr(|y:| , lYr-5l) 0.1107 0.898 
2 corr(|ys|, (y:11)7/corr(y:l , Lyra) 0.2716 0.939 
2 corr(lyil, Iy )5/corr(ly:l, Lys) 4x10 0.107 
3 corr(y2, y2) 0.0697 0.223 
3 corr(y2, y2 ,)/corr(ly:, Iy- 1.3134 0.779 
4 corr[sign(y;), sign(y;-1)] 0.0728 0.985 
4 corr[sign(y;), sign(y;—2)] 0.0031 0.392 
5 corr[|yr| , sign(y)] 0.0114 0.670 
6 lr = Dea lyl/T 0.0068 0.357 
6 SD») = £L, (yl — lr)? /(T-1) 0.0059 0.627 
6  Dlr/SD(ly:) 1.1589 0.071 

[EL 0 -F/T -D| /sDdyi)> 0.227 0.781 
[Shor -FDT =H] /SDd» Dt 4.4595 0.844 


The posterior distribution accounts reasonably well for the properties of interest 
identified in Example 8.3.2, at least taken one at a time. The most difficult features 
appear to be the autocorrelation functions of |y,| and sign(y,). Exercise 8.3.3 takes 
up these and some other possibilities for posterior predictive analysis. 


Exercise 8.3.1 Predictive Analyses Suppose that p[h(y) = h(y’) | y’, A] = 0. 


(a) At what point in a prior predictive analysis would an investigator become 
aware of this fact? 


(b) At what point in a posterior predictive analysis would an investigator become 
aware of this fact? 
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Exercise 8.3.2 Predictive Analyses of the Normal Linear Model for Earnings 
Repeat the analysis of Examples 8.3.1 and 8.3.3, using a normal distribution in 
place of the mixture of normals distribution. In how many of the functions h(y) 
do you find evidence of serious misspecification? 


Exercise 8.3.3 Further Predictive Analyses of the S&P 500 Returns Model 
This exercise entails a more extensive specification analysis of the Markov normal 
mixture model for the S&P 500 returns. 


(a) The analysis in Example 8.3.4 suggested potential difficulties in the dis- 
tribution of sample autocorrelations of the absolute returns and signs of 
returns. By modifying the code in the online appendix, use graphical meth- 
ods to undertake prior and posterior predictive analyses appropriate to each 
of these features of observed returns. 


(b 


<~ 


This model was motivated by the value at risk problem first introduced in 
Section 1.1.2. Extend the prior and posterior predictive analysis of Exam- 
ples 8.3.2 and 8.3.4, using a vector of interest œ suggested by this problem. 


Exercise 8.3.4 Predictive Analyses in the Class Size Examples The original 
work with the Massachusetts test score data (Example 5.1.1) assumed a normal 
linear regression. 


(a) Examples 5.4.1—5.4.4 investigated alternative models in which the regres- 
sion function is nonlinear in the covariates. Before proceeding to this step, 
we might wish to perform a posterior predictive analysis, in the context of 
Example 5.1.1, that would be sensitive to nonlinearity. Carry out such an 
analysis, using functions h(y) that are sensitive to nonlinearity. 


All of Examples 5.1.1 and 5.4.1—5.4.4 assumed that the distribution of the 
regression disturbances is normal. Design and execute a posterior predic- 
tive analysis that provides information on whether this aspect of the model 
specification appears to be reasonable. 


(b 


~ 


8.4 BAYESIAN COMMUNICATION 


An investigator cannot anticipate the uses to which her work will be put, or the 
variants on her model that may interest a client. Different uses call for different 
vectors of interest œ. Variants on her model will often revolve around changes in 
the prior distribution. Yet any investigator who has publicly reported results has 
confronted the constraint that only a few representative findings can be conveyed 
in written work. 

Posterior simulators provide a clear answer to the question of what the investi- 
gator should report, and in the process remove the constraint that only a few rep- 
resentative findings can be communicated. The investigator can provide electronic 
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access to the following M x (k + 3) simulator output matrix: 


ws?) pe?) py? |e?) oP 


we) poem) pye ae?) a” 


Even very large simulator output matrices can be stored at negligible cost and 
communicated quickly over the Internet. 

Given the simulator output matrix, the client can immediately compute approx- 
imations to posterior moments not reported or even considered by the investigator, 
using spreadsheet arithmetic. For example, a client reading a research report might 
be skeptical that the investigator’s model, prior, and data set provide much informa- 
tion about the effects of an interesting change in a policy variable on the outcome 
in question. In many cases the client may be able to answer such questions in less 
time than required to read the paper. 

With a small amount of additional effort, the client can modify many of the 
investigator’s assumptions. Suppose that the client wishes to evaluate E[h(@) | 
y°, Az] using the same likelihood function as the investigator, but using his own 
prior density p(@4 | Az) in place of the investigator’s prior density p(@4 | A1). 
Suppose further that the support of the investigator’s prior distribution includes 
the support of the client’s prior. Then the investigator’s posterior distribution may 
be regarded as an importance sampling distribution for the client’s posterior. The 
client reweights the investigator’s posterior simulation, using the function 


P@a ly’, A2) _ p@a | A2) 


ON = aT YAY POA TAD — 
The client thus approximates the posterior moment E[h(@) | Y7, A2] by 
M WO w (0) h(@™) 
hy, = Deanas OA" Waly RCO) (8.33) 


M m m 
D WOR On”) 


In (8.33) w” is drawn from the conditional density p(w | y°, o, A2) indepen- 


dently of any other draws, w,.(0,4) is any weighting that accompanies the posterior 
simulation provided by the investigator, and w(-) is the function in (8.32). If the 
investigator has employed an importance sampling algorithm, then Theorem 4.2.2 
implies He S E [h(@) | y°, A2] = ha. The same result holds for MCMC algo- 
rithms, as formalized in the following result. 


Theorem 8.4.1 Simulation-Consistent Approximation of Posterior Moments 
with a Reweighted Posterior Distribution Suppose that two models, A; and Ao, 
share the same parameter vector 04 and observables density p(y | 04), but have 
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respective prior densities p(04 | Ai) with 04 € ©4, and p(@4 | A2) with 04 € Og,. 
Suppose ©,4, C ©,, and define 


w(O4) = p04 | A2)/p(a | A1). 


Let p% } be an ergodic Markov chain with invariant density p(@,4 | y°, A1). Sup- 
pose further that the distribution æ | (y°, 04, A) is the same in the two models, and 
that @” is drawn independently from p(@ | y’, go”, A). If ha, = E{h(@ | y’, A2)] 
exists, then 


M M 
zu m m Mm), as. = 
RP = Y wO MOPY wO®) S Ta, 


m=1 m=1 


Proof: From Theorem 4.5.2 {0”, @®™} is ergodic with invariant density 
Pa ly’, Apl | y’, 04, A). 
Since, as shown in Section 8.2.1 
E[w(@a) | y°, Ai] = ply? | A2)/PO° | A1) 
and, by the same argument 
E[w@a)h(@a) | y’, Ail = PO | A2)/ply? | Ava, (8.34) 
the result follows. | 


Note that the prior densities in w(04) = p(@4 | A2)/p(04 | A1) can be replaced 
with arbitrary kernels in this result. 

In Theorem 4.7.1 uniform ergodicity was one of the sufficient conditions for a 
central limit theorem. If the investigator’s algorithm produces uniformly ergodic 
a” }, and if the ratio of the client’s prior density to the investigator’s prior density 
is bounded, then there is a central limit theorem under the client’s prior as well, 
as long as the client’s function of interest has finite posterior variance using this 
prior. This condition is strikingly similar to the sufficient condition discussed below 
Theorem 4.2.2. This is not surprising; the client is using the investigator’s posterior 
as his importance sampling distribution. 


Theorem 8.4.2 A Central Limit Theorem for Weighted MCMC Simulators 
Given the notation and assumptions of Theorem 8.4.1, suppose also that a”) } is 
uniformly ergodic, that var[h(@) | Y%, A2] exists and is finite, and w(04) < W < co 
V 64 € Og,. Then there exists t > 0 such that 


—M) — d 
M'? (ha, — ha) > NO, 7°). 
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Proof: Arguing exactly as in the proof of Theorem 4.5.2, uniform ergodic- 
ity of p% } implies uniform ergodicity of p% ,@”}, Let Ay and Az be any 
two real numbers, not both 0. Define q, (04, œ) = A, w(04) +A2.w(O,4)h(w), and 
observe that E[q, (04, @) | y’, A1] = 41W + Dh a, = q, exists and is finite. From 
Theorem 4.7.1, we obtain 


M 
mV” Ç 1.09”, 0) -— n| 4 NO, t2). 


m=1 


Since this is true for all à 4 0, it follows [see Rao (1965), Theorem 2c.5(iv)] that 


m“ D WOPO | (wha, 
= My w(6) = 


A standard application of the delta method, as in the proof of Theorem 4.2.2, yields 
the result. m 


m2 
w 


) 4 N,V). 


2 : : —(M) . 3 i 
Approximating the numerical accuracy of h is especially important when the 
reweighting method of Theorem 8.4.1 is used. An ill-behaved weighting function 
w(04) will lead to poor approximation of /4,. If this is the case, it is important 
that the client be aware of this fact. This problem can be approached in a fashion 

similar to the methods employed in Theorems 4.7.1 and 4.7.3. 


Theorem 8.4.3 Variance of the Approximation Error in Weighted MCMC Sim- 
ulators In addition to the assumptions of Theorem 8.4.2, suppose that 


= [w(0%"), w04) ho] 


Zm 


is a stationary process with autocovariance function C; = COV (Zm, Zm+j) and spec- 
tral density function S,(4) = XX fat C; exp(—iA J), ‘id that the eigenvalues of 
S,(A) are bounded uniformly both deere and away from zero. Then 


= = 2 
= r° [Sz (0) — 2h 4S2 (0) + 4,82, (0), 
where r = p(y? | A2)/p(y? | Ai). 
Proof: Straightforward application of the delta method. | 


Corollary 8.4.1 Numerical Standard Errors for Weighted MCMC Simulators 
In addition to the assumptions of Theorem 8.4.3, suppose also that S,(0) “> S,(0). 
Then 


POD = YS: (0) — Mia, SaO +a, Sar 01$ 2, 


where W% = M7! YY, wo”), 
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The numerical standard error of Ty, is then (77 /M)!/2, whose computation 
is guided by the following result. 


Theorem 8.4.4 Simulation Consistency of Numerical Standard Errors for 
Weighted MCMC Simulators Define {z,,} as in Theorem 8.4.3. Let 7” = 
M! YY a, and 


M 
CH = MF (En -TEn -IY G =0,1,..). 
m=j+1 


Let L(M) satisfy the same conditions as in Theorem 4.7.3. Then 


L-1 
S0) = Cy + YOI = s)/L]- (CO? +. C0") 5 8,0). 


s=l 


Proof: See Newey and West (1987). E 


BACC permits arbitrary weighting in the computation of posterior moments. 
Of course, this may also be accomplished with standard mathematical appli- 
cations software or even spreadsheet arithmetic. However, BACC also utilizes 
Theorems 8.4.2—8.4.4 to provide numerical standard errors, using either default 
values of the tapering function L(M) of Theorem 8.4.4, or values of L/M specified 
by the user. 


Example 8.4.1 Bayesian Communication in the Class Size Example (The on- 
line appendix contains data, annotated code, and output for this example.) To illus- 
trate the process of Bayesian communication about priors and posterior moments 
outlined in this section, return to the normal linear model of Example 5.1.1. Sup- 
pose that the client’s prior is the original prior distribution set out in that example. 
The investigator, however, has used a prior distribution of the coefficient vector B 
with prior precision H that is 5 times larger than the precision in the client’s prior 
distribution. Using the algorithm of Example 4.3.1 with M = 100,000 iterations, 
the investigator, if asked, would report prior and posterior moments as follows: 


Investigator’s Model: Prior and Posterior Moments, Numerical Accuracy 


Prior Posterior 
Coefficient Mean SD Mean SD NSE RNE 
intercept 710 268.84 682.42 10.52 .0329 1.02 
str coefficient 0 6.7447 —0.6853 0.2644 .00085 1.07 
el coefficient 0 7.1414  —0.4098 0.2800 .00079 1.25 
lunch coefficient 0 1.7321 —0.5190 0.0678  .00020 1.12 
income coefficient 0 75.620 16.517 2.955 .00091 1.04 
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In these results “NSE” is the numerical standard error computed according to 
Theorem 4.7.3, with L/M = 0.02, and “RNE” is the relative numerical efficiency 
of the approximation: the ratio of the posterior variance to the product of M and 
the squared numerical standard error. If RNE is close to 1, then the efficiency 
of the MCMC algorithm is close to that of a hypothetical simulator making i.i.d. 
drawings from the posterior distribution itself. That is the case here because there 
is very little serial correlation in the algorithm of Example 4.3.1. 

If the client were to execute the algorithm of Example 4.3.1 with his prior—that 
is, to repeat the exercise reported in Example 5.1.1—then the client, if asked, would 
report prior and posterior moments as follows: 


Client’s Model: Prior and Posterior Moments, Numerical Accuracy 


Prior Posterior 
Coefficient Mean SD Mean SD NSE RNE 
intercept 710 120.23 682.61 10.49 .0310 1.15 
str coefficient 0 3.1063 —0.6757 0.2635 .00081 1.07 
el coefficient 0 3.1937 —0.3992 0.2783  .00086 1.04 
lunch coefficient 0 0.7746 —0.5098 0.0675  .00020 1.12 
income coefficient (0) 33.818 16.415 2.956 .00097 0.93 


The posterior moments are not exactly the same as those in Example 5.1.1, because 
the client has used a different seed for the random-number generator. Alternatively, 
the client might choose to reweight the simulations of the investigator. Denoting the 
investigator’s prior precision matrix by H, and the client’s prior precision matrix 
by H,, the client will employ the weighting function 


w(04) = w(B) = exp[—(B — B)' H, — H,)(B — B)/21. 
Then the client, if asked, would report prior and posterior moments as follows: 


Reweighted Model: Prior and Posterior Moments, Numerical Accuracy; 
Comparison 


Prior Posterior 
ni ee, “TSE 
Coefficient Mean SD Mean SD NSE RNE p 
intercept 710 120.23 682.57 10.50 .0382 0.76 0.383 
str coefficient 0 3.1063 —0.6751 0.2638 .00092 0.96 0.663 


el coefficient 0 3.1937 —0.4009 0.2795 .00090 0.96 0.192 
lunch coefficient 0 0.7746 —0.5098 0.0677 .00022 0.94 0.958 
income coefficient 0 33.818 16.429 2.949 .00106 0.78 0.328 
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The rightmost column reports the outcome of a formal test that the posterior mean 
when the investigator’s simulations are reweighted is the same as the posterior mean 
in the client’s simulation, for each of the five coefficients. We expect the p values 
to be uniformly distributed in the unit interval, and these results are consistent 
with that hypothesis. There is a loss in efficiency in reweighting the investigator’s 
simulations, apparent in RNE values that are somewhat (but not much) lower than 
those that result if the client carries out the simulation directly. 


Exercise 8.4.1 Completing the Argument Derive (8.34). 


Exercise 8.4.2 An informal examination of the three sets of moments reported 
in Example 8.4.1 might suggest that although the moments in the reweighted model 
are quite similar to those in the client’s model, they are not that different from those 
in the investigator’s model, either. Test this conjecture formally. 


Exercise 8.4.3 Recall that both the “weak” and “strong” priors in Example 5.1.1 
produced marginal likelihood values that were lower than that of the original prior. 


(a) If the investigator were to steadily decrease the precision of her prior, in 
Example 8.4.1, the corresponding marginal likelihood would also decrease. 
Pick an even lower precision than the investigator used in Example 8.4.1, 
and reweight the output of the investigator’s simulator to obtain posterior 
moments of the coefficients under the client’s posterior. Verify that the qual- 
ity of the approximation of the client’s posterior moments is essentially as 
effective as it was in Example 8.4.1. 

(b) Using the investigator’s posterior from Example 8.4.1, reweight to obtain 
posterior moments under the “strong” prior of Example 5.1.1. Examine the 
relative numerical efficiency of the approximation. 

(c) Repeat part (b), but substitute the original prior of Example 5.1.1 for the 
investigator’s prior. Examine the relative numerical efficiency, and explain 
why it is not much higher than that in part (b). 


8.5 DENSITY RATIO ROBUSTNESS BOUNDS 


An alternative to providing the posterior simulation matrix is to report a range 
of posterior moments, corresponding to all possible prior distributions within a 
specified class of distributions. One such class is the density ratio class, discussed 
in Section 3.3. It consists of all prior density kernels k(0 4 | A) bounded above and 
below by kernels a(04) and b(0 4). Corresponding to each posterior moment of 
interest, E[e (04) | y°, A], the investigator reports the smallest and largest possible 
values over all prior density kernels satisfying 0 < a(04) < k(@4 | A) < b(04) Y 
0 AE © Ais 

It turns out that it is easy to approximate these bounds using the output of a 
posterior simulator. The details of the simulation algorithm are unimportant; the 
method is the same whether the simulator utilizes a variant of MCMC, importance 


278 BAYESIAN INVESTIGATION 


sampling, or any other method that provides a weighted or unweighted sample from 
the posterior distribution. This is a substantial practical attraction of the density 
ratio class, because the same procedures and software can be used for all posterior 
simulation methods. It is necessary only to know that E[g(@,) | y°, A] exists, and 
that for simulator output {9}, function of interest g(04) and known weighting 
function w(04) 


M m oO m m m 
Doa WOR PO? 10, AKOR | A80) 


M m o m m 
D WON PO? 10%, AKOX |A) 


7 I P(y® |04, A)k(04 | A)g (04) d0 4 

mp uo (8.35) 

f p |04, A)k(04 | A) dO4 
Oa 


for all prior density kernels k(04 | A) in the density ratio class. As indicated in 
Section 3.3, the problem of finding the lower bound E[g(@4) | y°, A] on the pos- 
terior moment is the same as finding the upper bound E[g(@4) | y°, A], and so this 
section concentrates on the latter question. 

To simplify the notation, let 8m = gow” ) (m = 1,..., M), and suppose that the 
simulator output is ordered so that gm is monotone nondecreasing. Let ty denote an 
M x 1 vector of ones and consider approximating E[g¢(@4) | y°, A] by maximizing 


M M 
O(r) = gr/Uyr = 5 snn J > Fm 
m=1 


m=1 


subject to u < r < v, where um = we” )a(Oe”)/k(OW” | S) and Um = w04”) 
dO”) /k(0"” | S) (m =1,..., M) and where k(0 4 | S) is the prior density kernel 
used in the simulation. We first develop an efficient solution of this discrete maxi- 
mization problem, and then show that the solution provides a simulation-consistent 
approximation of E[g(04) | y°, A]. 

The solution of the discrete problem parallels the solution of the exact problem 
given in Theorem 3.3.1. 


Theorem 8.5.1 All solutions Qy of max, Q(r) subject to u < r < v are of 
the form 


r= (u1, wees Us, Us+1, e Bia Vm)’, (8.36) 
where s has the property 


8s < Qs < 8gs+1 (8.37) 


DENSITY RATIO ROBUSTNESS BOUNDS 279 


and 


s M 
= Ns = &mUm +E _ §&mUm 
Fe yn Be = Demmi Sat Dei BH (8.38) 


D s M 
d ) Um + ) Um 
m=1 m=s+l 


Proof: Because the sign of 


M M 
aQ = a rj) 8m aa SIr — 8m — O(ri,...,7m) 
SS SS rT 


ia rs ri) Oa ri) 


does not involve rm, a necessary condition for maximization iS Fm = Um if 8m < 
Q(r) and Fm = Vm if Sn > Q(r). Hence all solutions Qy are characterized by 
(8.36)-(8.38). E 


Some relations between {Q,} and {g,} are useful in computing max, Q (r) and 
in showing that Qy => Elg (04) | y°, Al. 


Theorem 8.5.2 For Qı,..., Qm defined in (8.38), and any s, we obtain 


(a) Qs < gs ? Qs < Os-1 => Os41 < Qs 
(b) Qs > 8541 Qs+1 > Os => Os > Os-1 
(c) Os = Os41 > gs S Qs < 8sti Š Qs+1 < &s+2 


Proof: Simple algebra shows 


Os41 = Qs = (Os = 8s+1)(Vs+1 = us+1)/Ds+1 (8.39) 
= (Os41 — gs+1)(Vs+1 = Uus41)/Ds. (8.40) 


Because a(04) < b(0,4)V0,4 € Og, it follows that v, — us >OVs=1,...,M. 


(a) From (8.40), Qs < gs = > Qs < Qs-1; from (8.39), Qs < gs < 8541 > 
Os41 < Qs. 

(b) From (8.39), Qs > 8541 € Qs41 > Qs; from (8.40), Qs > 8541 = Zs => 
Qs > Os-13 

(c) From (8.39) and (8.40), Qs = Os4) = > Qs = 8541 = Qs41, and gs < 
gs+1 < gs+2; from (8.40), gs41 < Os+1 => Qs+1 = Qs, and from (8.39), 
Qs < 8s+1 = Qs+1 < Qs. a 


Theorem 8.5.2 shows that (8.38) and (8.37) are sufficient for Q, = Om. It also 
provides a computationally efficient solution of the discrete problem. 


Corollary 8.5.1 Q = max, Q(r) subject to u < r < v may be computed as 
follows: 
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1. Sort g(6"), ee go”) so that 2m = g(0"”) is monotone nondecreasing 
in m. 

2. Using successive bisection (Press et al. 1992, Section 3.4), find an index s 
such that gs < Qs < 8541. 


3. Set Qu = Oy. 


The sufficiency result in Theorem 8.5.2 means that no further search is required 
once the condition in step 2 of Corollary 8.5.1 is satisfied. Computation time for 
step 2 is of the order log, (M), whereas computation time for step | is of the order 
M log, (M) (Press et al. 1992, Section 8.3). Step 1 is essential to any approach, so 
as M — oo, Corollary 8.5.1 provides the most efficient algorithm possible. 


Theorem 8.5.3 Suppose that the convergence condition (8.35) is true for all 
prior density kernels in the density ratio class 


k(04 | A): 0 <a(O4) < k(O4 | A) < (04) V O04 € Og, 
and define Q y as in Corollary 8.5.1. Then 


Ou S Elg(@a) ly’, Al. 


Proof: For any real c, define k(0 4; c) as in Theorem 3.3.1 and let s = s(M) be 
the integer s of Corollary 8.5.1. Let 
sı = s,(M) = max{m : gm < Elg (014) | Y°, Al}. 
Then Qs, z E[g(04) | y°, A] by assumption. 


For any c < E[g(04) | y°, A], let s2 = s2:(M) = max{m : 8m < c}. By assump- 
tion and Theorem 3.3.1, we obtain 


7 f PUY 10a, AKO CBOr)dvOa) 
Q; 22 —______________ < EA 


I py? |04, A)k(0 4; c)dv(0a) 
Os 


Hence limy_,o. P[s;(M) > s2(M)] = 1. By the monotonicity properties of {Q;} 
established in Theorem 8.5.2, limy_... P[s(M) > s2(M)] = 1. 


A symmetric argument holds beginning with any c > E[g(@4) | y°, Al. | 


Theorem 8.5.3 applies to any posterior simulation from a model with observ- 
ables density p(y° | 04, A) and prior density kernel k(@,4 | A). It does not require 
a(04) < k(@,4 | A) < b(O,4), but in general, bounds of this type will be of inter- 
est to the client. BACC incorporates the yet more specific instance a(@4) = r7! - 
k(04 | A), b(04) =r -k(04 | A), illustrated in Example 5.1.2. The user specifies 
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the function of interest vector gosh), the weight vector wo” )(m=1,..., M), 
and one or more values of r > 1. BACC computes the lower and upper bounds 
E[g(@4) | y°, A] and E[g(04) | y°, A] corresponding to each value of r, using the 
algorithm described in Corollary 8.5.1. 


Exercise 8.5.1 Completing the Argument Derive (8.39) and (8.40). 
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censored normal linear model (Continued) 
Gibbs sampler, 198 
censoring, categorical, 199 
central limit theorem 
asymptotic posterior analysis, 95—96 
direct sampling 
actions, 108—109 
moments, 106-107 
quantiles, 106-107 
importance sampling 
actions, 117-118 
moments, 114 
Markov chain Monte Carlo 
actions, 148 
moments, 147—148 
moments, weighted, 273—274 
certainty equivalence principle, 48 
client, 2 
examples, 4, 6, 87, 245 
complete model, 10, 25, 153 
ancillarity and, 34—35 
concentrated expectations, see variance reduction 
confidence region (interval), 57 
consistent estimation, see asymptotic 
concentration 
convergence conditions 
direct sampling 
actions, 108—109 
moments, 106-107 
quantiles, 106-107 
Gibbs sampler, 137-139 
importance sampling 
actions, 117-118 
moments, 114 
Markov chain Monte Carlo 
actions, 148 
moments, 135 
vector of interest, 135—136 
credible sets, 56—61 
confidence region and, 57 
examples, 60-61, 191-192 
highest probability density, 57-59 
examples, 59-60 
optimality, 58 
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invariance, 57, 60—61 
posterior predictive analysis and, 267—268 


data, 8, 22 

data-generating process, 91 

decision theory, 17-18, 46-55 
Bayesian, 46—47 
non-Bayesian, 51 

density ratio tests, 247—251 
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importance sampling, 254-255 
observables density, for, 249-250 
prior density, for, 250—251 
Dirac delta function, 123, 172 
direct sampling, 106-110 
antithetic sampling and, 130-132 
central limit theorem 
actions, 108-109 
moments, 106-107 
quantiles, 106-107 
concentrated expectations and, 128—129 
convergence conditions 
actions, 108—109 
moments, 106-107 
quantiles, 106-107 
density ratio test, 247—248 
examples, 107—108, 110 
inverse c.d.f. method, 109 
distribution 
beta, 203 
chi square, 26 
Dirichlet, 203 
discrete normal mixture, 208 
exponential, 36, 44, 55, 70, 86, 96 
exponential family, 42—43 
gamma, 26, 167 
half-normal, 36, 43, 70 
log-normal, 55 
normal 
bivariate, 110, 140, 146—147 
conditional multivariate, 171 
density kernel, 29-30 
multivariate, 26 
truncated univariate, 37, 61, 113, 119 
univariate, 89—90 
normal-gamma, 41 
Poisson, 43, 86 
Student-t 
multivariate, 42 
univariate, 90-91 
uniform, 36, 43, 57, 60, 86 
Wishart, 165, 167 


earnings conditional distribution, 212—215 
extensions, 193, 218—219 
normal distribution inadequate for, 213 
posterior predictive analysis, 268—269 
prior predictive analysis, 263—266 
earnings data, 212 
employment spells, 45—46, 226 
estimation, 50-53 
examples, 52-53, 55, 60, 61, 71 
evidence, 98—99 
expected loss, 17-18 
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expected utility, 17 
experiment, 98 


filtered probabilities, 236 
forecasting, 132—133, 221 


Gibbs sampler, 120-122, 137-139 

autogregessive model, in, 126-127, 132-133, 
224-225 

Behrens-fisher problem, in, 125-126 

bivariate normal distribution, in 140 

censored normal linear model, in, 198 

convergence conditions, 137—139 

covariate selection and, 173-174 

history, 120 

improper posterior and, 140-141 

inequaltiy constraints and, 127, 140, 171-172 

marginal likelihood approximation and, 
257-258 

Markov normal mixture linear model, in, 234 

missing data and, 167 

nonlinear regression model, in, 126-127, 131 

normal mixture linear model, in, 211—212 

normal linear regression model, in, 121, 156 

panel data and, 168 

probilt linear model, in, 201 

random coefficients and, 168 

restricted linear regression model, in, 138, 
171, 173-174 


seemingly unrelated regressions model, in, 166 


Student-t linear model, in, 206—208 


hierarchical prior, see prior density 
highest posterior density region, see credible 
sets, highest probability density 
hyperparameters, 75 
example, 76-77 


identification, 140—141, 181, 208, 210 
importance sampling, 114-119 
acceptance sampling and, 116-117 
acceptance sampling hybrid, 117 
antithetic sampling and, 133 
central limit theorem 
actions, 117—118 
moments, 114 
convergence conditions 
actions, 117-118 
moments, 114 
density ratio test, 254-255 
example, 231 
history, 114 
marginal likelihood approximation and, 
256-257 
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Metropolis independence chain and, 123-124 
prior reweighting of, 117 
improper prior denisty, see prior denisty, 
improper 
independent finite state model, 202-205 
BACC, in, 204 
component of other models, 208, 226 
conjugate prior distribution, 203-204 
empty cells in, 205 
marginal likelihood, 204 
posterior distribution, 204 
interval data, 45, 199, 215-216 
investigator, 2—3 
examples, 87, 245-246 
modeling and, 15 
parameters and, 51—52 
tools for, 245-246 
iterated expectations, law of, 128, 129, 132-133 


joint distribution tests, 251—254 


examples, 253-254 


kernel, 24 
invariant, 134 
transition, 134 
ergodic, 135 
Harris recurrent, 135 
p-irreducible, 134 
aperiodic, 134 
uniformly ergodic, 147 
Kronecker delta function, 203 
Kullback-Leibler information, 92 


latent variables, 22, 73-76 
examples, 73, 196—197, 200-201, 206, 208, 
233 
likelihood function, 22 
likelihood principle, 97—103 
Lindley’s paradox, 83, 86-87 
linear model with serial correlation, see 
autoregressive model 
loss function, 17—18, 46—47, 50-52 
examples, 159 
linear-exponential, 54 
linear-linear, 48—49, 54 
model choice and, 65, 71 
quadratic, 47—48, 53 
smooth, 108—109, 117—118, 148 
zero-one, 50, 55 


marginal likelihood, 16, 23 
approximation by simulation, 255—259 
BACC, in, 260 
density ratio method, 259-260 
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marginal likelihood (Continued) 
examples, 157, 182—183, 187, 213 
Gibbs sampler, using, 257—258 
importance sampling, using, 256-257 
decomposition, 67 
examples, 62-64, 70, 161, 175, 204, 226, 233 
improper prior and, 83—84 
model comparison and, 62—65 
numerical approximation of, 157 
predictive liklihood and, 67 
Markov chain Monte Carlo, 119-127, 133—152 
central limit theorem 
actions, 148 
moments, 147-148 
moments, weighted, 273—274 
convergence assessment, 145-152 
convergence conditions 
actions, 148 
moments, 135 
vector of interest, 135—136 
history, 120, 133 
hypbrid methods, 142—147 
Metropolis within Gibbs, 143-145 
transition mixtures, 142—143, 147 
numerical accuracy, 145-152 
numerical standard error, 149, 274—275 
reweighting of, 272-275 
separated partial means test for, 149-150 
Markov finite state model, first order, 220—233 
BACC, in, 231 
conjugate prior distribution, 230 
marginal likelihood, 233 
marginal likelihood approximation, 261 
nonstationary model, 229—230 
posterior distributions, 230, 231 
posterior simulator, 231 
properties, 226—229 
stationary model, 230—231 
Markov normal mixture linear model, 233—243 
applciations, 13, 238-241, 266-267, 270 
conditional posteriors, 234 
label switching, 242 
posterior predictive analysis, 270 
posterior simulator, 234 
prior predictive analysis, 266-267 
merger, 1-2, 55-56 
Metropolis independence chain, 123—124 
convergence, 139, 147 
marginal likelihood and, 257 
Metropolis within Gibbs, 143-145 
convergence, 144-145 
examples, 207—208, 224-225 
Metropolis-Hastings algoirthm, 122-124, 
139-140 
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convergence conditions, 139-140 
examples, 125, 231 
history, 120 
missing data, 34—35, 167, 225-226 
mixed experiment, 99 
model averaging, 15—16, 62 
examples, 71 
model choice, 65, 71 
Monte Hall problem, see televised game show 


nonlinear regression, 175—193 
basis functions, with, 185—190 
examples, 186-190 
prior specification, 187 
smoothness priors, with, 176-184 
examples, 180-184 
prior specification, 176—180 
normal linear regression model, 25—29 
ancillary statistic, 34 
asymptotic concentration, 94—95 
BACC, in, 154-161 
Bayes factor, 64-65 
Bayesian communication, 275-277 
covariate selection, 172—174 
generalized, 30 
geometric interpretation, 28—29 
Gibbs sampler, 121 
convergence, 137, 156 
inequality constraints and, 170—172 
hierarchical prior distribution, 77—78 
highest posterior density region, 59—60 
improper prior distribution, 81-84 
inequality constraints, 138, 140, 170-174 
marginal likelihood, 62—64, 175 
marginal likelihood approximation, 157 
nonlienar in coefficients, 126—127, 131 
nonlinear in covariates, see nonlinear 
regression 
nuisance parameters, 36 
omitted covariates, 96—97 
outliers, 218 
poserior predictive analysis, 271 
prediction problem, 82-83, 107-108, 
161-162 
predictive density, 67—70 
prior distribution, 26 
conditionally conjugate, 40 
conjugate, 39, 40—42, 107-108 
prior preditive analysis, 271 
retricted, 112—113, 125, 127, 169-175 
short rank, 30 
sufficient statistic, 33 
normal mixture linear model, 208—215 
BACC, in, 155 
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normal mixture linear model (Continued) 
censored, 215—216 
conditional posteriors, 211—212 
conditionally conjugate prior, 209-210 
examples, 212-215, 
Gibbs sampler, 209-210 
labeling, 210 
marginal likelihood approximation, 213 
outliers and, 218 
posterior predictive analysis, 268—269 
prior predictive analysis, 263—266 
properties, 208—209 
public school class size, 217 

notation 
covariates X, 26 
data y?, 22 
Dirac delta function 59 (-), 123 
distribution of interst J, 106 
Gibbs transition G, 121 
Hastings transition H, 122 
Kronecker delta function 6 (-, -), 203 
MCMC transition C, 133 
measure v, 22 
model A, 22 
models, alternative Aj, 61 
observables y;, 21 
order of unobservables k4, 22 
probability density p (-), 22 
sample size T, 22 
sample space Y, 22 
source distribution S, 111 
subsample Y,, 22 
unobservables 64, 22 
vector of interst œ, 24 

nuisance parameters, 35—36 
example, 36 

numerical standard error (NSE), 107 
BACC, in, 156-157 
Markov chain Monte Carlo, 149 
reweighting and, 274-275 


observables, 7, 21—22 
order-restricted inference, 174-175 
outliers, 218, 253 


panel data, 168-169 
panel study of income dynamics (PSID), 
212 

posterior density (distribution), 10, 24 
exponential family, of, 43 
kernel instandard form, 24 
multinomdal, 151—152 
normalized, 24 
vector of interest, of, 25 


posterior odds ratio, 16, 64 
Bayes critical value and, 65 
posterior predictive analysis, see specification 
analysis 
posterior simulation, 13—15, 105—106 
acceptance sampling, 111-113 
direct sampling, 106-110 
efficiency, 141—142 
Gibbs sampler, 120-122, 137-139 
importance sampling, 114-119 
joint distribution test for, 253—254 
Markov chain Monte Carlo, 114—127, 
133-152 
Metropolis-Hastings algorithm, 122-124, 
139-140 
precision, 26 
prediction step, 235-236 
predictive Bayes factor, see Bayes factor, 
predictive 
predictive density, 66 
examples, 67—70, 161—162, 236-238 
improper prior and, 85-86 
predictive likelihood, 66 
example, 67-70 
marginal likelihood and, 67 
prior density (distribution), 10, 23 
conditionally conjugate, 39—42 
examples, 40, 164—166, 197, 209-210 
conjugate, 38-39 
exponential family, in, 43 
density-ratio class, 87-91 
BACC, in, 160, 281 
bounds approximation, 277—281 
example, 160 
hierarchical, 73—78 
examples, 74, 76-77, 168-169, 192 
multi-tier, 77 
two-tier, 74—75 
improper, 78—87 
credible set and, 80 
limit of proper priors, 79-81 
marginal likelihood and, 83 
normal linear regression model, 81—83 
predictive density and, 87 
Jeffreys, 84-85 
examples, 86 
normal-gamma, 41 
reweighting, 117, 272 
robust, 87 
sensitivity to, 157—158, 159, 242-243 
transformation of unobservables, 78—79 
prior odds ratio, 16, 64 
prior predictive analysis, see specification 
analysis 
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probit linear model, 200-202 
BACC, in, 201 
examples, 200-201 
Gibbs sampler, 201 
marginal likelihood approximation, 260 
ordered probit, 202 

public school class size, 154-161 
Bayesian communication and, 275-277 
decisions regarding, 4 
loss functions, 159—160 
Massachusetts (MCAS) data, 155 
nonlinear regression and, 180—190 
normal linear regression model and, 

154-158 

normal mixture linear model and, 217—218 
posterior predictive analysis, 271 
Student-t linear model and, 217 


quantile 
direct simulation of posterior, 106—107 
linear-linear loss function, 49 


random coefficients, 168 
random walk Metropolis chain, 123 
convergence, 139 
example, 125 
Rao-Blackwell theorem, 128—129 
Rao-Blackwellization, see concentrated 
expectations 
relative numerical efficiency (RNE), 117 
examples, 156, 275-277 
importance sampling, 119 
reversibility condition, 124 
reweightting, 117, 272 
central limit theorem for moments, 
273-274 
convergence conditions, 272—273 
risk, 47 


seemingly unrelated regressions (SUR), 
162-169 
BACC, in, 166 
conditional posteriors, 166 
conditionally conjugate prior, 165—166 
Gibbs sampler, 166 
hierarchical prior, 168—169 
inequaltiy constraints, 174—175 
sequential experiment, 101 
simulation consistency, see convergence 
conditions 
simultaneous equations, 37 
smoothed probabilities, 236 
source density, 111 
specification analysis, 262—271 
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posterior predictive analysis, 267—270 
examples, 268—269, 271 
prior predictive analysis, 262—267 
examples, 263—267, 271 
Standard and Poors 500, 6, 239, 243 
stopping rule, 97—98, 101 
Student-t linear model, 206—208 
BACC, in, 208 
censored, 215—216 
conditional posteriors, 207—208 
example, 217 
mixture distribution, component in, 257 
prior distribution, 206 
student : teacher ratio, see public school class 
size 
substochastic, 134 
sufficient statistic, 31—33 
ex ante and ex post equivalence, 31 
examples, 33, 36—38, 45—46, 98, 203 
exponential family, in, 42—43 
factorization criterion, 32 


televised game show, 12 


unit simplex, 203 
unobservables, 7, 22 
update step, 241 
updating, 10-11 
examples, 178—179 
predictive likelihood and, 66 


value at risk, 5-6 
posterior inference for, 238—243 
posterior predictive analysis, 270-271 
prior predictive analysis, 266-267, 271 
simulation and, 5—6 
vector of interest, 24 
variance reduction, 127—133 
antithetic sampling, 130-132 
asymptotic properties, 131 
examples, 131, 132-133 
forecasting and, 132-133 
importance sampling and, 133 
concentrated expectations, 128—130 
examples, 129—130, 174 
forecasting and, 132-133 
vector of interest, 9, 15, 17, 24 
examples, 17, 24, 25, 47, 51, 176, 221 


warm-up, see burn-in 

weak conditionality principle, 99 
weak sufficiency principle, 99 
Wiener process, 176 
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